Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag

Mateusz Jończyk <mat.jonczyk@xxxxx> · Sun, 7 Jul 2024 21:50:16 +0200

W dniu 6.07.2024 o 16:30, Mateusz Jończyk pisze:
> Hello,
>
> Linux 6.9+ cannot start a degraded RAID1 array when the only remaining
> device has the write-mostly flag set. Linux 6.8.0 works fine, as does
> 6.1.96.
[snip]
> After some investigation, I have determined that the bug is most likely in
> choose_slow_rdev() in drivers/md/raid1.c, which doesn't set max_sectors
> before returning early. A test patch (below) seems to fix this issue (Linux
> boots and appears to be working correctly with it, but I didn't do any more
> advanced experiments yet).
>
> This points to
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> as the most likely culprit. However, I was running into other bugs in mdadm when
> trying to test this commit directly.
>
> Distribution: Ubuntu 20.04, hardware: a HP 17-by0001nw laptop.

I have been testing this patch carefully:

1. I have been reliably getting deadlocks when adding / removing devices
on an array that contains a component with the write-mostly flag set
- while the array was loaded with fsstress. When the array was idle,
no such deadlocks happened. This occurred also on Linux 6.8.0
though, but not on 6.1.97-rc1, so this is likely an independent regression.

2. When adding a device to the array (/dev/sda1), I once got the following warnings in dmesg on patched 6.10-rc6:

        [ 8253.337816] md: could not open device unknown-block(8,1).
        [ 8253.337832] md: md_import_device returned -16
        [ 8253.338152] md: could not open device unknown-block(8,1).
        [ 8253.338169] md: md_import_device returned -16
        [ 8253.674751] md: recovery of RAID array md2

(/dev/sda1 has device major/minor numbers = 8,1). This may be caused by some interaction with udev, though.
I have also seen this on Linux 6.8.

Additionally, on an unpatched 6.1.97-rc1 (which was handy for testing), I got a deadlock
when removing a bitmap from such an array while it was loaded with fsstress.

I'll file independent reports, but wanted to give a head's up.

Greetings,

Mateusz