Re: F34 Cloud Amazon AMIs unbootable after updates

Christopher <ctubbsii@xxxxxxxxxxxxxxxxx> · Fri, 8 Oct 2021 02:40:17 -0400

On Thu, Oct 7, 2021 at 5:48 PM Benjamin Herrenschmidt
<benh@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, 2021-10-06 at 15:41 -0500, Joe Doss wrote:
> >
> > > Does anybody know how to fix a currently broken instance and can
> > > share
> > > their solution?
> >
> > Is there anything on the console log when you reboot it after the
> > updates? If you can share the log that would be helpful.

I've seen two kinds of errors in the console log.

The first is the kernel panic in the screenshot in my earlier email.
Getting a larger console log dump in that case, when it works at all,
just gets more of the same. I think I can avoid that error by avoiding
certain instance types. m5 seems to work, but m5a and m6 do not. I
think this might be related to the new UEFI boot type that EC2 now
supports. On newer instance types, they default to UEFI mode unless
legacy-bios is specified. Our AMIs do not specify a boot type, so they
would use the instance type's default. m5 defaults to legacy-bios
mode, and seems to work. I think m5a and m6 default to UEFI mode. We
should start specifying our boot mode explicitly in our AMI composes
(and eventually switch to UEFI mode, for a consistent experience in
the cloud that matches modern physical hardware that uses UEFI). In
any case, I could be incorrect, but since this is a boot problem, it
seems possible the boot mode feature is the problem in those instance
types.

The second error is just an infinite scroll of a repeated error
message that looked like: A start job is running for
/dev/disâ€¦f1a0b12a  This is most likely
https://bugzilla.redhat.com/show_bug.cgi?id=2010058 and manually
applying the patch there seems to work (even though I wasn't sure if I
did it correctly).

>
> There's the new "ISC" (interactive serial console) that can help if you
> have grub timeout set to non-0...

I couldn't get that to work over ssh, but it did work on the web UI.
It wasn't very useful there, because the boot was stuck and wasn't
accepting input. I couldn't catch it early enough for grub. I'm not
quite certain of how to edit the grub2 config files to change the
timeout. This would have helped me select the previous kernel, to
avoid https://bugzilla.redhat.com/show_bug.cgi?id=2010058 , but it
wouldn't have helped with the kernel panic on m5a/m6 instance types.

>
> Otherwise, you can detach the EBS volume, attach it to another
> instance, mount & fixup, then back the other way around (the magic to
> re-attach the root device is to call it "xvda" without number).

That's what I ended up doing (but I always re-attach as /dev/sda1, how
it was before). I've done this plenty of times before. The main
problem this time was that there were no logs to investigate, and no
clues what to change to fix the issue.

I did attempt to chroot and do a DNF rollback. However, it seems DNF
history command is buggy, and crashes with a message called "Reason
Change". This appears to be a new thing in DNF transactions, and the
DNF history command doesn't know how to handle it for rollbacks and
undos. I used more primitive tools to change packages one at a time,
eventually reinstalling the kernel after applying the fix in
https://bugzilla.redhat.com/show_bug.cgi?id=2010058

>
> Cheers,
> Ben.

Thanks all for the tips and suggestions. Persistence paid off... eventually.
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure