Re: Hiding the grub menu by default on single OS installs

Peter Jones <pjones@xxxxxxxxxx> · Fri, 1 Jun 2018 14:03:31 -0400

On Thu, May 31, 2018 at 05:47:36PM +0200, Hans de Goede wrote:
> Hi,
> 
> On 31-05-18 15:20, Robert Marcano wrote:
> > On 05/31/2018 06:52 AM, Hans de Goede wrote:
> > > ...
> > > This will basically get us back the F28 behavior of showing the
> > > menu but only after a failed boot, I think that is a good
> > > solution, do you agree?
> > 
> > What is the definition of a successful boot? I ask because a machine
> > could boot perfectly, and when you try to interact with it on the
> > login screen, bugs on the display driver can change the screen to
> > garbage (I have seen this kind on bug long time ago), or lockup. So,
> > the user will be unable to activate any kind of restart with menu
> > enabled in order to try an older kernel, or boot to rescue mode.
> > 
> > I think instead of only detecting a successful boot, a machine that
> > wasn't properly shutdown should enable the menu
> 
> A broken install may still shutdown properly after the using pressing
> the power-button and/or trying ctrl+alt+del.
> 
> But this is an interesting suggestion, I think we should track both
> separately, so successful shutdown and successful boot and show the
> menu if either one is not true. That should make the chance of not
> being able to get the menu a lot smaller.

In my mind, the mechanism here looks like what I've sketched out below,
and I think it encapsulates the above as well as most of what I've seen
on this thread already.

The workflow is something like this:

- user updates the OS[0]
  - we automatically set the new OS to be booted /once/.
- we have a successful-boot-test.service that depends on [getty.target
  or graphical.target].  Upon starting, it sets a timer for some
  relatively long amount of time, like say 5 minutes, and at the end of
  that time it decides if booting worked and sets some state to let us
  know.
  - we also provide a tool for an admin to set a specific state, since
    they know best.
- if a user logs in and starts doing stuff before the timer expires,
  we booted successfully, and we set the new OS to be default and mark
  it as having succeeded.
- if the machine is rebooted *unexpectedly*[1] without any successful
  login before the timer expires, we reboot and get the previous OS, and
  we can detect that it failed during that boot and take whatever
  appropriate action
- if the timer expires without user activity, or if there's an
  expected intermediate reboot we need to do, it's indeterminate if it
  worked or not; we set the one-shot again[5].
  - in the case where it's an expected reboot, we re-set the count of
    how many times we've reached the indeterminate state
  - otherwise we add one to the count
  - if the count is above some threshold (say 3) in some amount of time
    (say a day), set a one-shot variable that says to show the menu.
  - on server[2] we're going to want some indicator of "is successfully
    doing it's job" instead of login; that's probably a separate
    feature.
  - It probably is worth having the power button be an indicator of how
    we shut down, and make that be a reason to show the menu, at least
    in some cases, if you haven't done things like gone into settings
    and told the power button to do nothing.

And then concerning the actual menu+countdown (or more importantly, when
to probe for the keyboard), we don't show the menu or probe for key
state unless one of the following is the case:

- a persistent grub environment variable that says /not/ to show the
  menu is /absent/ or set to false.  (i.e. the user or some install
  class[3] disabled this feature, or if grubenv has been corrupted, or
  if we're on an architecture that insists on not having nice things[4],
  etc.)
- a one-shot grub environment variable, that says to show the menu, is
  set to true.  (i.e. user asked for the menu when they rebooted the
  machine)
- indeterminate boot count is > 1
- the previous boot is not marked as indeterminate or success

[ 0] I'm being deliberately vague here because I think I mean "updates
     stuff that runs between (inclusively) the bootloader and
     [getty.target, graphical.target]" for the traditional OS, and not
     exactly the same criteria for Atomic, but both can reasonably be
     captured in one description.
[ 1] There are cases like if we do an selinux relabel during boot and
     then reboot the machine, or other situations analogous to that,
     where the reboot is known to be unrelated to the success or failure
     of the update.
[ 2] We could reasonably ship this enabled on workstation+desktop+laptop
     environments with servers disabled until there's some less
     wishy-washy description here.  Despite what mattdm said above in
     this thread, I think ultimately we do want it on server, even
     though we care less about flicker-free booting there - the
     countdown and probing aren't an insignificant chunk of the boot
     time, and the time it takes to reboot can come to dominate
     downtime.
[ 3] See [2].
[ 4] As a for-instance, IBM ppc* machines nerf out the block device
     write() call in their firmware, so we don't have one-shot variables
     there at all and can't do any of this.
[ 5] I might be able to be convinced there's a case for local config
     policy to be injected here, but I think the tool mentioned earlier
     is probably enough.

Now you all get to tell me all the ways I'm wrong ;)

-- 
  Peter
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx/message/M2XVBWGDULWYQYCTMTQ3J5VNOS7XB5NO/