On 06/01/2018 02:03 PM, Peter Jones wrote:
On Thu, May 31, 2018 at 05:47:36PM +0200, Hans de Goede wrote:
> On 31-05-18 15:20, Robert Marcano wrote:
>> On 05/31/2018 06:52 AM, Hans de Goede wrote:
>>> This will basically get us back the F28 behavior of showing the
>>> menu but only after a failed boot, I think that is a good
>>> solution, do you agree?
>> What is the definition of a successful boot? I ask because a machine
>> could boot perfectly, and when you try to interact with it on the
>> login screen, bugs on the display driver can change the screen to
>> garbage (I have seen this kind on bug long time ago), or lockup. So,
>> the user will be unable to activate any kind of restart with menu
>> enabled in order to try an older kernel, or boot to rescue mode.
>> I think instead of only detecting a successful boot, a machine that
>> wasn't properly shutdown should enable the menu
> A broken install may still shutdown properly after the using pressing
> the power-button and/or trying ctrl+alt+del.
> But this is an interesting suggestion, I think we should track both
> separately, so successful shutdown and successful boot and show the
> menu if either one is not true. That should make the chance of not
> being able to get the menu a lot smaller.
In my mind, the mechanism here looks like what I've sketched out below,
and I think it encapsulates the above as well as most of what I've seen
on this thread already.
The workflow is something like this:
- user updates the OS
- we automatically set the new OS to be booted /once/.
- we have a successful-boot-test.service that depends on [getty.target
or graphical.target]. Upon starting, it sets a timer for some
relatively long amount of time, like say 5 minutes, and at the end of
that time it decides if booting worked and sets some state to let us
- we also provide a tool for an admin to set a specific state, since
they know best.
- if a user logs in and starts doing stuff before the timer expires,
we booted successfully, and we set the new OS to be default and mark
it as having succeeded.
- if the machine is rebooted *unexpectedly* without any successful
login before the timer expires, we reboot and get the previous OS, and
we can detect that it failed during that boot and take whatever
- if the timer expires without user activity, or if there's an
expected intermediate reboot we need to do, it's indeterminate if it
worked or not; we set the one-shot again.
- in the case where it's an expected reboot, we re-set the count of
how many times we've reached the indeterminate state
- otherwise we add one to the count
- if the count is above some threshold (say 3) in some amount of time
(say a day), set a one-shot variable that says to show the menu.
- on server we're going to want some indicator of "is successfully
doing it's job" instead of login; that's probably a separate
- It probably is worth having the power button be an indicator of how
we shut down, and make that be a reason to show the menu, at least
in some cases, if you haven't done things like gone into settings
and told the power button to do nothing.
And then concerning the actual menu+countdown (or more importantly, when
to probe for the keyboard), we don't show the menu or probe for key
state unless one of the following is the case:
- a persistent grub environment variable that says /not/ to show the
menu is /absent/ or set to false. (i.e. the user or some install
class disabled this feature, or if grubenv has been corrupted, or
if we're on an architecture that insists on not having nice things,
- a one-shot grub environment variable, that says to show the menu, is
set to true. (i.e. user asked for the menu when they rebooted the
- indeterminate boot count is > 1
- the previous boot is not marked as indeterminate or success
[ 0] I'm being deliberately vague here because I think I mean "updates
stuff that runs between (inclusively) the bootloader and
[getty.target, graphical.target]" for the traditional OS, and not
exactly the same criteria for Atomic, but both can reasonably be
captured in one description.
[ 1] There are cases like if we do an selinux relabel during boot and
then reboot the machine, or other situations analogous to that,
where the reboot is known to be unrelated to the success or failure
of the update.
[ 2] We could reasonably ship this enabled on workstation+desktop+laptop
environments with servers disabled until there's some less
wishy-washy description here. Despite what mattdm said above in
this thread, I think ultimately we do want it on server, even
though we care less about flicker-free booting there - the
countdown and probing aren't an insignificant chunk of the boot
time, and the time it takes to reboot can come to dominate
[ 3] See .
[ 4] As a for-instance, IBM ppc* machines nerf out the block device
write() call in their firmware, so we don't have one-shot variables
there at all and can't do any of this.
[ 5] I might be able to be convinced there's a case for local config
policy to be injected here, but I think the tool mentioned earlier
is probably enough.
Now you all get to tell me all the ways I'm wrong ;)
I am also opposed to the logistics of relying on some boot failure
indication to show the menu because of failing storage media preventing
the variable from being set.
Depending on the storage failure, it is not unreasonable to boot with
"ro init=/bin/sh" on the cmdline to get to some read-only environment to
begin recovering data, but it would become cumbersome by F30 if the
timeout is set to 0 and the environment is BIOS where there's no EFI
variables to influence GRUB.