On Thu, Oct 7, 2021 at 5:48 PM Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> wrote: > > On Wed, 2021-10-06 at 15:41 -0500, Joe Doss wrote: > > > > > Does anybody know how to fix a currently broken instance and can > > > share > > > their solution? > > > > Is there anything on the console log when you reboot it after the > > updates? If you can share the log that would be helpful. I've seen two kinds of errors in the console log. The first is the kernel panic in the screenshot in my earlier email. Getting a larger console log dump in that case, when it works at all, just gets more of the same. I think I can avoid that error by avoiding certain instance types. m5 seems to work, but m5a and m6 do not. I think this might be related to the new UEFI boot type that EC2 now supports. On newer instance types, they default to UEFI mode unless legacy-bios is specified. Our AMIs do not specify a boot type, so they would use the instance type's default. m5 defaults to legacy-bios mode, and seems to work. I think m5a and m6 default to UEFI mode. We should start specifying our boot mode explicitly in our AMI composes (and eventually switch to UEFI mode, for a consistent experience in the cloud that matches modern physical hardware that uses UEFI). In any case, I could be incorrect, but since this is a boot problem, it seems possible the boot mode feature is the problem in those instance types. The second error is just an infinite scroll of a repeated error message that looked like: A start job is running for /dev/dis…f1a0b12a This is most likely https://bugzilla.redhat.com/show_bug.cgi?id=2010058 and manually applying the patch there seems to work (even though I wasn't sure if I did it correctly). > > There's the new "ISC" (interactive serial console) that can help if you > have grub timeout set to non-0... I couldn't get that to work over ssh, but it did work on the web UI. It wasn't very useful there, because the boot was stuck and wasn't accepting input. I couldn't catch it early enough for grub. I'm not quite certain of how to edit the grub2 config files to change the timeout. This would have helped me select the previous kernel, to avoid https://bugzilla.redhat.com/show_bug.cgi?id=2010058 , but it wouldn't have helped with the kernel panic on m5a/m6 instance types. > > Otherwise, you can detach the EBS volume, attach it to another > instance, mount & fixup, then back the other way around (the magic to > re-attach the root device is to call it "xvda" without number). That's what I ended up doing (but I always re-attach as /dev/sda1, how it was before). I've done this plenty of times before. The main problem this time was that there were no logs to investigate, and no clues what to change to fix the issue. I did attempt to chroot and do a DNF rollback. However, it seems DNF history command is buggy, and crashes with a message called "Reason Change". This appears to be a new thing in DNF transactions, and the DNF history command doesn't know how to handle it for rollbacks and undos. I used more primitive tools to change packages one at a time, eventually reinstalling the kernel after applying the fix in https://bugzilla.redhat.com/show_bug.cgi?id=2010058 > > Cheers, > Ben. Thanks all for the tips and suggestions. Persistence paid off... eventually. _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure