On Thu, May 31, 2018 at 05:47:36PM +0200, Hans de Goede wrote: > Hi, > > On 31-05-18 15:20, Robert Marcano wrote: > > On 05/31/2018 06:52 AM, Hans de Goede wrote: > > > ... > > > This will basically get us back the F28 behavior of showing the > > > menu but only after a failed boot, I think that is a good > > > solution, do you agree? > > > > What is the definition of a successful boot? I ask because a machine > > could boot perfectly, and when you try to interact with it on the > > login screen, bugs on the display driver can change the screen to > > garbage (I have seen this kind on bug long time ago), or lockup. So, > > the user will be unable to activate any kind of restart with menu > > enabled in order to try an older kernel, or boot to rescue mode. > > > > I think instead of only detecting a successful boot, a machine that > > wasn't properly shutdown should enable the menu > > A broken install may still shutdown properly after the using pressing > the power-button and/or trying ctrl+alt+del. > > But this is an interesting suggestion, I think we should track both > separately, so successful shutdown and successful boot and show the > menu if either one is not true. That should make the chance of not > being able to get the menu a lot smaller. In my mind, the mechanism here looks like what I've sketched out below, and I think it encapsulates the above as well as most of what I've seen on this thread already. The workflow is something like this: - user updates the OS[0] - we automatically set the new OS to be booted /once/. - we have a successful-boot-test.service that depends on [getty.target or graphical.target]. Upon starting, it sets a timer for some relatively long amount of time, like say 5 minutes, and at the end of that time it decides if booting worked and sets some state to let us know. - we also provide a tool for an admin to set a specific state, since they know best. - if a user logs in and starts doing stuff before the timer expires, we booted successfully, and we set the new OS to be default and mark it as having succeeded. - if the machine is rebooted *unexpectedly*[1] without any successful login before the timer expires, we reboot and get the previous OS, and we can detect that it failed during that boot and take whatever appropriate action - if the timer expires without user activity, or if there's an expected intermediate reboot we need to do, it's indeterminate if it worked or not; we set the one-shot again[5]. - in the case where it's an expected reboot, we re-set the count of how many times we've reached the indeterminate state - otherwise we add one to the count - if the count is above some threshold (say 3) in some amount of time (say a day), set a one-shot variable that says to show the menu. - on server[2] we're going to want some indicator of "is successfully doing it's job" instead of login; that's probably a separate feature. - It probably is worth having the power button be an indicator of how we shut down, and make that be a reason to show the menu, at least in some cases, if you haven't done things like gone into settings and told the power button to do nothing. And then concerning the actual menu+countdown (or more importantly, when to probe for the keyboard), we don't show the menu or probe for key state unless one of the following is the case: - a persistent grub environment variable that says /not/ to show the menu is /absent/ or set to false. (i.e. the user or some install class[3] disabled this feature, or if grubenv has been corrupted, or if we're on an architecture that insists on not having nice things[4], etc.) - a one-shot grub environment variable, that says to show the menu, is set to true. (i.e. user asked for the menu when they rebooted the machine) - indeterminate boot count is > 1 - the previous boot is not marked as indeterminate or success [ 0] I'm being deliberately vague here because I think I mean "updates stuff that runs between (inclusively) the bootloader and [getty.target, graphical.target]" for the traditional OS, and not exactly the same criteria for Atomic, but both can reasonably be captured in one description. [ 1] There are cases like if we do an selinux relabel during boot and then reboot the machine, or other situations analogous to that, where the reboot is known to be unrelated to the success or failure of the update. [ 2] We could reasonably ship this enabled on workstation+desktop+laptop environments with servers disabled until there's some less wishy-washy description here. Despite what mattdm said above in this thread, I think ultimately we do want it on server, even though we care less about flicker-free booting there - the countdown and probing aren't an insignificant chunk of the boot time, and the time it takes to reboot can come to dominate downtime. [ 3] See [2]. [ 4] As a for-instance, IBM ppc* machines nerf out the block device write() call in their firmware, so we don't have one-shot variables there at all and can't do any of this. [ 5] I might be able to be convinced there's a case for local config policy to be injected here, but I think the tool mentioned earlier is probably enough. Now you all get to tell me all the ways I'm wrong ;) -- Peter _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx/message/M2XVBWGDULWYQYCTMTQ3J5VNOS7XB5NO/