Re: [REGRESSION] Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)

Mateusz Jończyk <mat.jonczyk@xxxxx> · Tue, 30 Jul 2024 22:35:49 +0200

W dniu 28.07.2024 o 12:30, Mateusz Jończyk pisze:
> W dniu 25.07.2024 o 16:27, Paul E Luse pisze:
>> On Thu, 25 Jul 2024 09:15:40 +0200
>> Mateusz Jończyk <mat.jonczyk@xxxxx> wrote:
>>
>>> Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse
>>> <paul.e.luse@xxxxxxxxxxxxxxx> napisał/a:
>>>> On Wed, 24 Jul 2024 22:35:49 +0200
>>>> Mateusz Jończyk <mat.jonczyk@xxxxx> wrote:
>>>>
>>>>> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
>>>>>> W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
>>>>>>> Hello,
>>>>>>>
>>>>>>> In my laptop, I used to have two RAID1 arrays on top of NVMe and
>>>>>>> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
>>>>>>> for remaining data (LUKS
>>>>>>> + LVM + ext4). For performance, I have marked the RAID component
>>>>>>> device for /dev/md1 on the SATA SSD drive write-mostly, which
>>>>>>> "means that the 'md' driver will avoid reading from these
>>>>>>> devices if at all possible" (man mdadm).
>>>>>>>
>>>>>>> Recently, the NVMe drive started having problems (PCI AER errors
>>>>>>> and the controller disappearing), so I removed it from the
>>>>>>> arrays and wiped it. However, I have reseated the drive in the
>>>>>>> M.2 socket and this apparently fixed it (verified with tests).
>>>>>>>
>>>>>>>     $ cat /proc/mdstat
>>>>>>>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
>>>>>>> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
>>>>>>>           471727104 blocks super 1.2 [2/1] [_U]
>>>>>>>           bitmap: 4/4 pages [16KB], 65536KB chunk
>>>>>>>
>>>>>>>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
>>>>>>>           3142656 blocks super 1.2 [2/2] [UU]
>>>>>>>           bitmap: 0/1 pages [0KB], 65536KB chunk
>>>>>>>
>>>>>>>     md0 : active raid1 sdb4[3]
>>>>>>>           2094080 blocks super 1.2 [2/1] [_U]
>>>>>>>          
>>>>>>>     unused devices: <none>
>>>>>>>
>>>>>>> (md2 was used just for testing, ignore it).
>>>>>>>
>>>>>>> Today, I have tried to add the drive back to the arrays by
>>>>>>> using a script that executed in quick succession:
>>>>>>>
>>>>>>>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
>>>>>>>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
>>>>>>>
>>>>>>> This was on Linux 6.10.0, patched with my previous patch:
>>>>>>>
>>>>>>>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@xxxxx/
>>>>>>>
>>>>>>> (which fixed a regression in the kernel and allows it to start
>>>>>>> /dev/md1 with a single drive in write-mostly mode).
>>>>>>> In the background, I was running "rdiff-backup --compare" that
>>>>>>> was comparing data between my array contents and a backup
>>>>>>> attached via USB.
>>>>>>>
>>>>>>> This, however resulted in mayhem - I was unable to start any
>>>>>>> program with an input-output error, etc. I used SysRQ + C to
>>>>>>> save a kernel log:
>>>>>>>
>>>>> Hello,
>>>>>
>>>>> Unfortunately, hardware failure seems not to be the case.
>>>>>
>>>>> I did test it again on 6.10, twice, and in both cases I got
>>>>> filesystem corruption (but not as severe).
>>>>>
>>>>> On Linux 6.1.96 it seems to be working well (also did two tries).
>>>>>
>>>>> Please note: in my tests, I was using a RAID component device with
>>>>> a write-mostly bit set. This setup does not work on 6.9+ out of the
>>>>> box and requires the following patch:
>>>>>
>>>>> commit 36a5c03f23271 ("md/raid1: set max_sectors during early
>>>>> return from choose_slow_rdev()")
>>>>>
>>>>> that is in master now.
>>>>>
>>>>> It is also heading into stable, which I'm going to interrupt.
> Hello,
>
> With much effort (challenging to reproduce reliably) I think have nailed down the issue to the read_balance refactoring series in 6.9:
[snip]
> After code analysis, I have noticed that the following check that was present in old
> read_balance() is not present (in equivalent form in the new code):
>
>                 if (!test_bit(In_sync, &rdev->flags) &&
>                     rdev->recovery_offset < this_sector + sectors)
>                         continue;
>
> (in choose_slow_rdev() and choose_first_rdev() and possibly other functions)
>
> which would cause the kernel to read from the device being synced to before
> it is ready.

Hello,

I think have made a reliable (and safe) reproducer for this bug:

Prerequisite: create an array on top of 2 devices 1GB+ large:

mdadm --create /dev/md4 --level=1 --raid-devices=2 /dev/nvme0n1p5 --write-mostly /dev/sdb8
The script:
-------------------------------8<------------------------

#!/bin/bash

mdadm /dev/md4 --fail /dev/nvme0n1p5
sleep 1
mdadm /dev/md4 --remove failed
sleep 1

# fill with random data
shred -n1 -v /dev/md4
# fill with zeros
shred -n0 -zv /dev/nvme0n1p5

sha256sum /dev/md4

echo 1 > /proc/sys/vm/drop_caches

date

# calculate a shasum while the array is being synced
( sha256sum /dev/md4; date ) &
mdadm /dev/md4 --add --readwrite /dev/nvme0n1p5
date

-------------------------------8<------------------------

The two shasums should be equal, but they were different in my tests on affected kernels.

Also, in my tests with the script, *without* a write-mostly device in the array, the problems did not happen.

Greetings,

Mateusz