Re: Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)

Mateusz Jończyk <mat.jonczyk@xxxxx> · Thu, 25 Jul 2024 09:15:40 +0200

Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse <paul.e.luse@xxxxxxxxxxxxxxx> napisał/a:
>On Wed, 24 Jul 2024 22:35:49 +0200
>Mateusz Jończyk <mat.jonczyk@xxxxx> wrote:
>
>> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
>> > W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
>> >> Hello,
>> >>
>> >> In my laptop, I used to have two RAID1 arrays on top of NVMe and
>> >> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
>> >> for remaining data (LUKS
>> >> + LVM + ext4). For performance, I have marked the RAID component
>> >> device for /dev/md1 on the SATA SSD drive write-mostly, which
>> >> "means that the 'md' driver will avoid reading from these devices
>> >> if at all possible" (man mdadm).
>> >>
>> >> Recently, the NVMe drive started having problems (PCI AER errors
>> >> and the controller disappearing), so I removed it from the arrays
>> >> and wiped it. However, I have reseated the drive in the M.2 socket
>> >> and this apparently fixed it (verified with tests).
>> >>
>> >>     $ cat /proc/mdstat
>> >>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
>> >> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
>> >>           471727104 blocks super 1.2 [2/1] [_U]
>> >>           bitmap: 4/4 pages [16KB], 65536KB chunk
>> >>
>> >>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
>> >>           3142656 blocks super 1.2 [2/2] [UU]
>> >>           bitmap: 0/1 pages [0KB], 65536KB chunk
>> >>
>> >>     md0 : active raid1 sdb4[3]
>> >>           2094080 blocks super 1.2 [2/1] [_U]
>> >>          
>> >>     unused devices: <none>
>> >>
>> >> (md2 was used just for testing, ignore it).
>> >>
>> >> Today, I have tried to add the drive back to the arrays by using a
>> >> script that executed in quick succession:
>> >>
>> >>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
>> >>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
>> >>
>> >> This was on Linux 6.10.0, patched with my previous patch:
>> >>
>> >>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@xxxxx/
>> >>
>> >> (which fixed a regression in the kernel and allows it to start
>> >> /dev/md1 with a single drive in write-mostly mode).
>> >> In the background, I was running "rdiff-backup --compare" that was
>> >> comparing data between my array contents and a backup attached via
>> >> USB.
>> >>
>> >> This, however resulted in mayhem - I was unable to start any
>> >> program with an input-output error, etc. I used SysRQ + C to save
>> >> a kernel log:
>> >>
>> > Hello,
>> >
>> > It is possible that my second SSD has some problems and high read
>> > activity during RAID resync triggered it. Reads from that drive are
>> > now very slow (between 10 - 30 MB/s) and this suggests that
>> > something is not OK.
>> 
>> Hello,
>> 
>> Unfortunately, hardware failure seems not to be the case.
>> 
>> I did test it again on 6.10, twice, and in both cases I got
>> filesystem corruption (but not as severe).
>> 
>> On Linux 6.1.96 it seems to be working well (also did two tries).
>> 
>> Please note: in my tests, I was using a RAID component device with
>> a write-mostly bit set. This setup does not work on 6.9+ out of the
>> box and requires the following patch:
>> 
>> commit 36a5c03f23271 ("md/raid1: set max_sectors during early return
>> from choose_slow_rdev()")
>> 
>> that is in master now.
>> 
>> It is also heading into stable, which I'm going to interrupt.
>
>Hi Mateusz,
>
>I'm pretty interested in what is happening here especially as it
>relates to write-mostly.  Couple of questions for you:
>
>1) Are you able to find a simpler reproduction for this, for example
>without mixing SATA and NVMe.  Maybe just using two known good NVMe
>SSDs and follow your steps to repro?

Hello,

Well, I have three drives in my laptop: NVMe, SATA SSD (in the DVD bay) and SATA HDD (platter). I could do tests on top of these two SATA drives.
But maybe it would be easier for me to bisect (or guess-bisect) in the current setup, I haven't made up my mind yet.

>
>2) I don't fully understand your last two statements, maybe you can
>clarify?  With your max_sectors patch does it pass or fail?  If pass,
>what do mean by "I'm going to interrupt"? It sounds like you mean the
>patch doesn't work and you are trying to stop it??

Without this patch I wouldn't be able to do the tests. Without it, degraded RAID1 with a single drive in write-mostly mode doesn’t start at all.

With my last statement I meant that I was going to stop this patch from going to stable kernels. At this point, it doesn’t seem to me that my patch
is the direct cause of the problems, that I missed something. However, I think that it is currently better to fail this setup outright rather than risk
somebody's data.

I have made further tests:

- vanilla 6.8.0 with a write-mostly drive works correctly,

- vanilla 6.10-rc6 without the write mostly bit set also works correctly. 

So it seems that the problem happens only with the write-mostly mode and after 6.8.0.

Greetings,

Mateusz