5.18: likely useless very preliminary bug report: mdadm raid-6 boot-time assembly failure

Nix <nix@xxxxxxxxxxxxx> · Mon, 18 Jul 2022 13:20:16 +0100

So I have a pair of RAID-6 mdraid arrays on this machine (one of which
has a bcache layered on top of it, with an LVM VG stretched across
both). Kernel 5.16 + mdadm 4.0 (I know, it's old) works fine, but I just
rebooted into 5.18.12 and it failed to assemble. mdadm didn't display
anything useful: an mdadm --assemble --scan --auto=md --freeze-reshape
simply didn't find anything to assemble, and after that nothing else was
going to work. But rebooting into 5.16 worked fine, so everything was
(thank goodness) actually still there.

Alas I can't say what the state of the blockdevs was (other than that
they all seemed to be in /dev, and I'm using DEVICE partitions so they
should all have been spotted) or anything else about the boot because
console scrollback is still a nonexistent thing (as far as I can tell),
it scrolls past too fast for me to video it, and I can't use netconsole
because this is the NFS and loghost server for the local network so all
the other machines are more or less frozen waiting for NFS to come back.

Any suggestions for getting more useful info out of this thing? I
suppose I could get a spare laptop and set it up to run as a netconsole
server for this one boot... but even that won't tell me what's going on
if the error (if any) is reported by some userspace process rather than
in the kernel message log.

I'll do some mdadm --examine's and look at /proc/partitions next time I
try booting (which won't be before this weekend), but I'd be fairly
surprised if mdadm itself was at fault, even though it's the failing
component and it's old, unless the kernel upgrade has tripped some bug
in 4.0 -- or perhaps 4.0 built against a fairly old musl: I haven't even
recompiled it since 2019. So this looks like something in the blockdev
layer, which at this stage in booting is purely libata-based. (There is
an SSD on the machine, but it's used as a bcache cache device and for
XFS journals, both of which are at layers below mdadm so can't possibly
be involved in this.)

-- 
NULL && (void)