Hi,
On 04-06-18 09:16, Hans de Goede wrote:
Hi,
Note I've dropped the fedora-devel list (-ETOOMUCHBIKESHED)
and added Javier and Jan to the Cc.
Ugh, so clearly I failed to remove fedora-devel from the CC.
Ah well. I hope this mistake shows that there is nothing
nefarious going on here and that Javier, Peter and I are
really just working on trying making the boot experience
nicer for Workstation users, while at the same time very
thoroughly keeping in mind the rescue / things broke
scenario.
Regards,
Hans
> On 01-06-18 20:03, Peter Jones wrote:
>> On Thu, May 31, 2018 at 05:47:36PM +0200, Hans de Goede wrote:
>>> Hi,
>>>
>>> On 31-05-18 15:20, Robert Marcano wrote:
>>>> On 05/31/2018 06:52 AM, Hans de Goede wrote:
>>>>> ...
>>>>> This will basically get us back the F28 behavior of showing the
>>>>> menu but only after a failed boot, I think that is a good
>>>>> solution, do you agree?
>>>>
>>>> What is the definition of a successful boot? I ask because a machine
>>>> could boot perfectly, and when you try to interact with it on the
>>>> login screen, bugs on the display driver can change the screen to
>>>> garbage (I have seen this kind on bug long time ago), or lockup. So,
>>>> the user will be unable to activate any kind of restart with menu
>>>> enabled in order to try an older kernel, or boot to rescue mode.
>>>>
>>>> I think instead of only detecting a successful boot, a machine that
>>>> wasn't properly shutdown should enable the menu
>>>
>>> A broken install may still shutdown properly after the using pressing
>>> the power-button and/or trying ctrl+alt+del.
>>>
>>> But this is an interesting suggestion, I think we should track both
>>> separately, so successful shutdown and successful boot and show the
>>> menu if either one is not true. That should make the chance of not
>>> being able to get the menu a lot smaller.
>>
>> In my mind, the mechanism here looks like what I've sketched out below,
>> and I think it encapsulates the above as well as most of what I've seen
>> on this thread already.
>>
>> The workflow is something like this:
>>
>> - user updates the OS[0]
>> - we automatically set the new OS to be booted /once/.
>
> Hmm, I see you also refer to atomic and there this makes sense, but
> in the traditional distro model how would we implement this?
>
> We could implement boot a new kernel once, but since a xserver /
> mesa / gnome update might break things just as easily as a kernel
> update can break things I'm not sure if adding boot-once functionality
> to the traditional model is really helpful.
>
> Reverting to the old kernel might help in some cases, but we are
> also going to get false-positives. I've a feeling this is going to
> become really messy. As such I don't think this is a change we
> can "sell" easily. Some people really don't seem to like the idea of
> any changes to the grub config / menu at all.
>
> I've a feeling that selling the hidden menu by itself is enough
> of a hassle without adding in booting a new kernel once to test it.
> I realize that this in a way is a way to lessen the impact of the
> menu being hidden, but I'm not 100% sold on this.
>
> I would rather just show the menu after a failed boot and have
> reverting to the kernel be a conscious choice of the user. I have
> a number of reasons for this:
>
> 1) Don't revert to older kernel on false-positive failed boot detects
> (limit the result of a false-positive failed boot detect to showing
> the menu without any side
>
> 2) Updates typically come in batches and the boot failure may well be
> caused by something else, so we're not necessarily helping the user
> here, even if the user manages to fix things he will now be running
> an older kernel for no good reason.
>
> 3) Since reverting to the old kernel may not be enough, we still need
> to show the menu after a failed boot
>
> 4) Principle of least surprise, we are now making unrequested changes to
> the users system and not (really) notifying the user of this.
> For Atomic I envision that after switching back to the old snapshot /
> release the UI will show a dialog after login along the lines of:
> "The new 20190214 release did not work, we've reverted your machine
> to the 20190207 release" (but then better worded). We could do
> something similar for the kernel, assuming reverting to the old
> kernel will allow us to show the dialog, but we again have the whole
> false positive thing, so now we end up showing a scary dialog because
> of a false-positive failed-boot detect.
>
> So all in all I'm not a big fan of the boot once concept for the
> traditional Fedora version. I think it makes a lot of sense for Atomic
> and we should do it there, but not for Fedora.
>
> Another thing to keep in mind is that we don't really have much time
> to get things in place for F29, so especially for F29 this seems
> too complex and I would prefer to only add a "GRUB_AUTO_HIDE"
> option to /etc/default/grub which when set will make grub2-mkconfig
> generate a grub.cfg which will hides the menu unless a failed boot
> is detected and not make any changes wrt which kernel to boot when.
>
> This also has the added advantage that it avoids me touching the
> default selection code, which would collide with Javier's BLS work I think.
>
> Regards,
>
> Hans
>
>
>
>> - we have a successful-boot-test.service that depends on [getty.target
>> or graphical.target]. Upon starting, it sets a timer for some
>> relatively long amount of time, like say 5 minutes, and at the end of
>> that time it decides if booting worked and sets some state to let us
>> know.
>> - we also provide a tool for an admin to set a specific state, since
>> they know best.
>> - if a user logs in and starts doing stuff before the timer expires,
>> we booted successfully, and we set the new OS to be default and mark
>> it as having succeeded.
>> - if the machine is rebooted *unexpectedly*[1] without any successful
>> login before the timer expires, we reboot and get the previous OS, and
>> we can detect that it failed during that boot and take whatever
>> appropriate action
>> - if the timer expires without user activity, or if there's an
>> expected intermediate reboot we need to do, it's indeterminate if it
>> worked or not; we set the one-shot again[5].
>> - in the case where it's an expected reboot, we re-set the count of
>> how many times we've reached the indeterminate state
>> - otherwise we add one to the count
>> - if the count is above some threshold (say 3) in some amount of time
>> (say a day), set a one-shot variable that says to show the menu.
>> - on server[2] we're going to want some indicator of "is
successfully
>> doing it's job" instead of login; that's probably a separate
>> feature.
>> - It probably is worth having the power button be an indicator of how
>> we shut down, and make that be a reason to show the menu, at least
>> in some cases, if you haven't done things like gone into settings
>> and told the power button to do nothing.
>>
>> And then concerning the actual menu+countdown (or more importantly, when
>> to probe for the keyboard), we don't show the menu or probe for key
>> state unless one of the following is the case:
>>
>> - a persistent grub environment variable that says /not/ to show the
>> menu is /absent/ or set to false. (i.e. the user or some install
>> class[3] disabled this feature, or if grubenv has been corrupted, or
>> if we're on an architecture that insists on not having nice things[4],
>> etc.)
>> - a one-shot grub environment variable, that says to show the menu, is
>> set to true. (i.e. user asked for the menu when they rebooted the
>> machine)
>> - indeterminate boot count is > 1
>> - the previous boot is not marked as indeterminate or success
>>
>> [ 0] I'm being deliberately vague here because I think I mean "updates
>> stuff that runs between (inclusively) the bootloader and
>> [getty.target, graphical.target]" for the traditional OS, and not
>> exactly the same criteria for Atomic, but both can reasonably be
>> captured in one description.
>> [ 1] There are cases like if we do an selinux relabel during boot and
>> then reboot the machine, or other situations analogous to that,
>> where the reboot is known to be unrelated to the success or failure
>> of the update.
>> [ 2] We could reasonably ship this enabled on workstation+desktop+laptop
>> environments with servers disabled until there's some less
>> wishy-washy description here. Despite what mattdm said above in
>> this thread, I think ultimately we do want it on server, even
>> though we care less about flicker-free booting there - the
>> countdown and probing aren't an insignificant chunk of the boot
>> time, and the time it takes to reboot can come to dominate
>> downtime.
>> [ 3] See [2].
>> [ 4] As a for-instance, IBM ppc* machines nerf out the block device
>> write() call in their firmware, so we don't have one-shot variables
>> there at all and can't do any of this.
>> [ 5] I might be able to be convinced there's a case for local config
>> policy to be injected here, but I think the tool mentioned earlier
>> is probably enough.
>>
>> Now you all get to tell me all the ways I'm wrong ;)
>>