Re: Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)

Paul E Luse <paul.e.luse@xxxxxxxxxxxxxxx> · Thu, 25 Jul 2024 07:27:42 -0700

On Thu, 25 Jul 2024 09:15:40 +0200
Mateusz Jończyk <mat.jonczyk@xxxxx> wrote:

> Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse
> <paul.e.luse@xxxxxxxxxxxxxxx> napisał/a:
> >On Wed, 24 Jul 2024 22:35:49 +0200
> >Mateusz Jończyk <mat.jonczyk@xxxxx> wrote:
> >
> >> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
> >> > W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
> >> >> Hello,
> >> >>
> >> >> In my laptop, I used to have two RAID1 arrays on top of NVMe and
> >> >> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
> >> >> for remaining data (LUKS
> >> >> + LVM + ext4). For performance, I have marked the RAID component
> >> >> device for /dev/md1 on the SATA SSD drive write-mostly, which
> >> >> "means that the 'md' driver will avoid reading from these
> >> >> devices if at all possible" (man mdadm).
> >> >>
> >> >> Recently, the NVMe drive started having problems (PCI AER errors
> >> >> and the controller disappearing), so I removed it from the
> >> >> arrays and wiped it. However, I have reseated the drive in the
> >> >> M.2 socket and this apparently fixed it (verified with tests).
> >> >>
> >> >>     $ cat /proc/mdstat
> >> >>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
> >> >> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
> >> >>           471727104 blocks super 1.2 [2/1] [_U]
> >> >>           bitmap: 4/4 pages [16KB], 65536KB chunk
> >> >>
> >> >>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
> >> >>           3142656 blocks super 1.2 [2/2] [UU]
> >> >>           bitmap: 0/1 pages [0KB], 65536KB chunk
> >> >>
> >> >>     md0 : active raid1 sdb4[3]
> >> >>           2094080 blocks super 1.2 [2/1] [_U]
> >> >>          
> >> >>     unused devices: <none>
> >> >>
> >> >> (md2 was used just for testing, ignore it).
> >> >>
> >> >> Today, I have tried to add the drive back to the arrays by
> >> >> using a script that executed in quick succession:
> >> >>
> >> >>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
> >> >>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
> >> >>
> >> >> This was on Linux 6.10.0, patched with my previous patch:
> >> >>
> >> >>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@xxxxx/
> >> >>
> >> >> (which fixed a regression in the kernel and allows it to start
> >> >> /dev/md1 with a single drive in write-mostly mode).
> >> >> In the background, I was running "rdiff-backup --compare" that
> >> >> was comparing data between my array contents and a backup
> >> >> attached via USB.
> >> >>
> >> >> This, however resulted in mayhem - I was unable to start any
> >> >> program with an input-output error, etc. I used SysRQ + C to
> >> >> save a kernel log:
> >> >>
> >> > Hello,
> >> >
> >> > It is possible that my second SSD has some problems and high read
> >> > activity during RAID resync triggered it. Reads from that drive
> >> > are now very slow (between 10 - 30 MB/s) and this suggests that
> >> > something is not OK.
> >> 
> >> Hello,
> >> 
> >> Unfortunately, hardware failure seems not to be the case.
> >> 
> >> I did test it again on 6.10, twice, and in both cases I got
> >> filesystem corruption (but not as severe).
> >> 
> >> On Linux 6.1.96 it seems to be working well (also did two tries).
> >> 
> >> Please note: in my tests, I was using a RAID component device with
> >> a write-mostly bit set. This setup does not work on 6.9+ out of the
> >> box and requires the following patch:
> >> 
> >> commit 36a5c03f23271 ("md/raid1: set max_sectors during early
> >> return from choose_slow_rdev()")
> >> 
> >> that is in master now.
> >> 
> >> It is also heading into stable, which I'm going to interrupt.
> >
> >Hi Mateusz,
> >
> >I'm pretty interested in what is happening here especially as it
> >relates to write-mostly.  Couple of questions for you:
> >
> >1) Are you able to find a simpler reproduction for this, for example
> >without mixing SATA and NVMe.  Maybe just using two known good NVMe
> >SSDs and follow your steps to repro?
> 
> Hello,
> 
> Well, I have three drives in my laptop: NVMe, SATA SSD (in the DVD
> bay) and SATA HDD (platter). I could do tests on top of these two
> SATA drives. But maybe it would be easier for me to bisect (or
> guess-bisect) in the current setup, I haven't made up my mind yet.
> 

OK, thanks.

> >
> >2) I don't fully understand your last two statements, maybe you can
> >clarify?  With your max_sectors patch does it pass or fail?  If pass,
> >what do mean by "I'm going to interrupt"? It sounds like you mean the
> >patch doesn't work and you are trying to stop it??
> 
> Without this patch I wouldn't be able to do the tests. Without it,
> degraded RAID1 with a single drive in write-mostly mode doesn’t start
> at all.
> 
> With my last statement I meant that I was going to stop this patch
> from going to stable kernels. At this point, it doesn’t seem to me
> that my patch is the direct cause of the problems, that I missed
> something. However, I think that it is currently better to fail this
> setup outright rather than risk somebody's data.
> 

OK, I would say please do not try to stop the patch, it is a good fix
although maybe not completely solving your problem it should land.

Unless Kwai has another opinion.

-Paul

> I have made further tests:
> 
> - vanilla 6.8.0 with a write-mostly drive works correctly,
> 
> - vanilla 6.10-rc6 without the write mostly bit set also works
> correctly. 
> 
> So it seems that the problem happens only with the write-mostly mode
> and after 6.8.0.
> 
> Greetings,
> 
> Mateusz
> 
>