Re: Hiding the grub menu by default on single OS installs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Note I've dropped the fedora-devel list (-ETOOMUCHBIKESHED)
and added Javier and Jan to the Cc.

On 01-06-18 20:03, Peter Jones wrote:
On Thu, May 31, 2018 at 05:47:36PM +0200, Hans de Goede wrote:
Hi,

On 31-05-18 15:20, Robert Marcano wrote:
On 05/31/2018 06:52 AM, Hans de Goede wrote:
...
This will basically get us back the F28 behavior of showing the
menu but only after a failed boot, I think that is a good
solution, do you agree?

What is the definition of a successful boot? I ask because a machine
could boot perfectly, and when you try to interact with it on the
login screen, bugs on the display driver can change the screen to
garbage (I have seen this kind on bug long time ago), or lockup. So,
the user will be unable to activate any kind of restart with menu
enabled in order to try an older kernel, or boot to rescue mode.

I think instead of only detecting a successful boot, a machine that
wasn't properly shutdown should enable the menu

A broken install may still shutdown properly after the using pressing
the power-button and/or trying ctrl+alt+del.

But this is an interesting suggestion, I think we should track both
separately, so successful shutdown and successful boot and show the
menu if either one is not true. That should make the chance of not
being able to get the menu a lot smaller.

In my mind, the mechanism here looks like what I've sketched out below,
and I think it encapsulates the above as well as most of what I've seen
on this thread already.

The workflow is something like this:

- user updates the OS[0]
   - we automatically set the new OS to be booted /once/.

Hmm, I see you also refer to atomic and there this makes sense, but
in the traditional distro model how would we implement this?

We could implement boot a new kernel once, but since a xserver /
mesa / gnome update might break things just as easily as a kernel
update can break things I'm not sure if adding boot-once functionality
to the traditional model is really helpful.

Reverting to the old kernel might help in some cases, but we are
also going to get false-positives. I've a feeling this is going to
become really messy. As such I don't think this is a change we
can "sell" easily. Some people really don't seem to like the idea of
any changes to the grub config / menu at all.

I've a feeling that selling the hidden menu by itself is enough
of a hassle without adding in booting a new kernel once to test it.
I realize that this in a way is a way to lessen the impact of the
menu being hidden, but I'm not 100% sold on this.

I would rather just show the menu after a failed boot and have
reverting to the kernel be a conscious choice of the user. I have
a number of reasons for this:

1) Don't revert to older kernel on false-positive failed boot detects
   (limit the result of a false-positive failed boot detect to showing
    the menu without any side

2) Updates typically come in batches and the boot failure may well be
   caused by something else, so we're not necessarily helping the user
   here, even if the user manages to fix things he will now be running
   an older kernel for no good reason.

3) Since reverting to the old kernel may not be enough, we still need
   to show the menu after a failed boot

4) Principle of least surprise, we are now making unrequested changes to
   the users system and not (really) notifying the user of this.
   For Atomic I envision that after switching back to the old snapshot /
   release the UI will show a dialog after login along the lines of:
   "The new 20190214 release did not work, we've reverted your machine
    to the 20190207 release" (but then better worded). We could do
   something similar for the kernel, assuming reverting to the old
   kernel will allow us to show the dialog, but we again have the whole
   false positive thing, so now we end up showing a scary dialog because
   of a false-positive failed-boot detect.

So all in all I'm not a big fan of the boot once concept for the
traditional Fedora version. I think it makes a lot of sense for Atomic
and we should do it there, but not for Fedora.

Another thing to keep in mind is that we don't really have much time
to get things in place for F29, so especially for F29 this seems
too complex and I would prefer to only add a "GRUB_AUTO_HIDE"
option to /etc/default/grub which when set will make grub2-mkconfig
generate a grub.cfg which will hides the menu unless a failed boot
is detected and not make any changes wrt which kernel to boot when.

This also has the added advantage that it avoids me touching the
default selection code, which would collide with Javier's BLS work I think.

Regards,

Hans



- we have a successful-boot-test.service that depends on [getty.target
   or graphical.target].  Upon starting, it sets a timer for some
   relatively long amount of time, like say 5 minutes, and at the end of
   that time it decides if booting worked and sets some state to let us
   know.
   - we also provide a tool for an admin to set a specific state, since
     they know best.
- if a user logs in and starts doing stuff before the timer expires,
   we booted successfully, and we set the new OS to be default and mark
   it as having succeeded.
- if the machine is rebooted *unexpectedly*[1] without any successful
   login before the timer expires, we reboot and get the previous OS, and
   we can detect that it failed during that boot and take whatever
   appropriate action
- if the timer expires without user activity, or if there's an
   expected intermediate reboot we need to do, it's indeterminate if it
   worked or not; we set the one-shot again[5].
   - in the case where it's an expected reboot, we re-set the count of
     how many times we've reached the indeterminate state
   - otherwise we add one to the count
   - if the count is above some threshold (say 3) in some amount of time
     (say a day), set a one-shot variable that says to show the menu.
   - on server[2] we're going to want some indicator of "is successfully
     doing it's job" instead of login; that's probably a separate
     feature.
   - It probably is worth having the power button be an indicator of how
     we shut down, and make that be a reason to show the menu, at least
     in some cases, if you haven't done things like gone into settings
     and told the power button to do nothing.

And then concerning the actual menu+countdown (or more importantly, when
to probe for the keyboard), we don't show the menu or probe for key
state unless one of the following is the case:

- a persistent grub environment variable that says /not/ to show the
   menu is /absent/ or set to false.  (i.e. the user or some install
   class[3] disabled this feature, or if grubenv has been corrupted, or
   if we're on an architecture that insists on not having nice things[4],
   etc.)
- a one-shot grub environment variable, that says to show the menu, is
   set to true.  (i.e. user asked for the menu when they rebooted the
   machine)
- indeterminate boot count is > 1
- the previous boot is not marked as indeterminate or success

[ 0] I'm being deliberately vague here because I think I mean "updates
      stuff that runs between (inclusively) the bootloader and
      [getty.target, graphical.target]" for the traditional OS, and not
      exactly the same criteria for Atomic, but both can reasonably be
      captured in one description.
[ 1] There are cases like if we do an selinux relabel during boot and
      then reboot the machine, or other situations analogous to that,
      where the reboot is known to be unrelated to the success or failure
      of the update.
[ 2] We could reasonably ship this enabled on workstation+desktop+laptop
      environments with servers disabled until there's some less
      wishy-washy description here.  Despite what mattdm said above in
      this thread, I think ultimately we do want it on server, even
      though we care less about flicker-free booting there - the
      countdown and probing aren't an insignificant chunk of the boot
      time, and the time it takes to reboot can come to dominate
      downtime.
[ 3] See [2].
[ 4] As a for-instance, IBM ppc* machines nerf out the block device
      write() call in their firmware, so we don't have one-shot variables
      there at all and can't do any of this.
[ 5] I might be able to be convinced there's a case for local config
      policy to be injected here, but I think the tool mentioned earlier
      is probably enough.

Now you all get to tell me all the ways I'm wrong ;)

_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx/message/LLPND7NHDKBBJ5E34Q3MA5BRUQAPODP6/




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Fedora Announce]     [Fedora Users]     [Fedora Kernel]     [Fedora Testing]     [Fedora Formulas]     [Fedora PHP Devel]     [Kernel Development]     [Fedora Legacy]     [Fedora Maintainers]     [Fedora Desktop]     [PAM]     [Red Hat Development]     [Gimp]     [Yosemite News]

  Powered by Linux