On 18 Jul 2022, Roger Heflin told this: Oh I was hoping you'd weigh in :) > Did it drop you into the dracut shell (since you do not have > scroll-back this seem the case? or the did the machine fully boot up > and simply not find the arrays? Well, I'm not using dracut, which is a horrifically complex failure-prone nightmare as you note: I wrote my own (very simple) early init script, which failed as it is designed to do when mdadm doesn't assemble any arrays and dropped me into an emergency ash shell (statically linked against musl). As a result I can be absolutely certain that nothing has changed or been rebuilt in early init since I built my last working kernel. (It's all assembled into an initramfs by the in-kernel automated assembly stuff under the usr/ subdirectory in the kernel source tree.) Having the initramfs linked into the kernel is *such a good thing* in situations like this: I can be absolutely certain that as long as the data on disk is not fubared nothing can possibly have messed up the initramfs or early boot in general after the kernel is linked, because nothing can change it at all. :) (I just diffed both initramfses from the working and non-working kernels: the one in the running kernel I keep mounted under /run/initramfs after boot is over because it also gets used during late shutdown, and the one from the new broken one is still in the cpio archive in the build tree, so this was easy. They're absolutely identical.) > If it dropped you to the dracut shell, add nofail on the fstab > filesystem entries for the raids so it will let you boot up and debug. Well... the rootfs is *on* the raid, so that's not going to work. (Actually, it's under a raid -> bcache -> lvm stack, with one of two raid arrays bcached and the lvm stretching across both of them. If only the raid had come up, I have a rescue fs on the non-bcached half of the lvm so I could have assembled it in degraded mode and booted from that. But it didn't, so I couldn't.) Assembly does this: /sbin/mdadm --assemble --scan --auto=md --freeze-reshape The initramfs includes this mdadm.conf: DEVICE partitions ARRAY /dev/md/transient UUID=28f4c81c:f44742ea:89d4df21:6aea852b ARRAY /dev/md/slow UUID=a35c9c54:bcdbff37:4f18163e:a93e9aa2 ARRAY /dev/md/fast UUID=4eb6bf4e:7458f1f1:d05bdfe4:6d38ca23 MAILADDR postmaster@xxxxxxxxxxxxx (which is again identical on the working kernels.) > If it is inside the dracut shell it sounds like something in the > initramfs might be missing. Definitely not, thank goodness. > I have seen a dracut (re)build "fail" to > determining that a device driver is required and not include it in the > initramfs, and/or have seen the driver name change and the new kernel Oh yes: this is one of many horrible things about dracut. It assumes the new kernel is similar enough to the running one that it can use the one to configure the other. This is extremely not true when (as this machine is) it's building kernels for *other* machines in containers :) but the thing that failed was the machine's native kernel. (It is almost non-modular. I only have fat and vfat modules loaded right now: everything else is built in.) > I have also seen newer versions of software stacks/kernels > create/ignore underlying partitions that worked on older versions (ie > a partition on a device that has the data also--sometimes a > partitioned partition). I think I can be sure that only the kernel itself has changed here. I wonder if make oldconfig messed up and I lost libata or something? ... no, it's there. > Newer version have suddenly saw that /dev/sda1 was partitioned and > created a /dev/sda1p1 and "hidden" sda1 from scanning causing LVM not > not find pvs. > > I have also seen where the in-use/data device was /dev/sda1p1 and an > update broke partitioning a partition so only showed /dev/sda1, and so > no longer sees the devices. Ugh. That would break things indeed! This is all atop GPT... CONFIG_EFI_PARTITION=y no it's still there. I doubt anyone can conclude anything until I collect more info. This bug report is mostly useless for a reason!