Re: 5.18: likely useless very preliminary bug report: mdadm raid-6 boot-time assembly failure

Nix <nix@xxxxxxxxxxxxx> · Thu, 29 Sep 2022 13:41:41 +0100

On 22 Jul 2022, Roger Heflin verbalised:

> On Fri, Jul 22, 2022 at 5:11 AM Nix <nix@xxxxxxxxxxxxx> wrote:
>>
>> On 20 Jul 2022, Wols Lists outgrape:
>>
>> > On 20/07/2022 16:55, Nix wrote:
>> >> [    9.833720] md: md126 stopped.
>> >> [    9.847327] md/raid:md126: device sda4 operational as raid disk 0
>> >> [    9.857837] md/raid:md126: device sdf4 operational as raid disk 4
>> >> [    9.868167] md/raid:md126: device sdd4 operational as raid disk 3
>> >> [    9.878245] md/raid:md126: device sdc4 operational as raid disk 2
>> >> [    9.887941] md/raid:md126: device sdb4 operational as raid disk 1
>> >> [    9.897551] md/raid:md126: raid level 6 active with 5 out of 5 devices, algorithm 2
>> >> [    9.925899] md126: detected capacity change from 0 to 14520041472
>> >
>> > Hmm.
>> >
>> > Most of that looks perfectly normal to me. The only oddity, to my eyes, is that md126 is stopped before the disks become
>> > operational. That could be perfectly okay, it could be down to a bug, whatever whatever.
>>
>> Yeah this is the *working* boot. I can't easily get logs of the
>> non-working one because, well, no writable filesystems and most of the
>> interesting stuff scrolls straight off the screen anyway. (It's mostly
>> for comparison with the non-working boot once I manage to capture that.
>> Somehow. A high-speed camera on video mode and hand-transcribing? Uggh.)
>
> if you find the partitions missing if you initrd has kpartx on it that
> will create the mappings.
>
>   kpartx -av <device>

I may have to fall back to that, but the system is supposed to be doing
this for me dammit! :)

The initrd is using busybox 1.30.1 mdev and mdadm 4.0 both linked
against musl -- if this has suddenly broken, I suspect a lot of udevs
have similarly broken. But these are both old, upgraded only when
essential to avoid breaking stuff critical for boot (hah!): upgrading
all of these is on the cards to make sure it's not something fixed in
the userspace tools...

(Not been rebooting because of lots of time away from home: now not
rebooting because I've got probable flu and can't face it. But once
that's over, I'll attack this.)

> I wonder if it is some sort of module loading order issue and/or
> build-in vs module for one or more of the critical drives in the
> chain.

Definitely not! This kernel is almost totally non-modular:

compiler@loom 126 /usr/src/boost% cat /proc/modules 
vfat 20480 1 - Live 0xffffffffc0176000
fat 73728 1 vfat, Live 0xffffffffc015c000

That's *it* for the currently loaded modules (those are probably loaded
because I built a test kernel and had to mount the EFI boot fs to
install it, which is not needed during normal boots because the
initramfs is linked into the kernel image).

-- 
NULL && (void)