W dniu 25.07.2024 o 16:27, Paul E Luse pisze: > On Thu, 25 Jul 2024 09:15:40 +0200 > Mateusz Jończyk <mat.jonczyk@xxxxx> wrote: > >> Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse >> <paul.e.luse@xxxxxxxxxxxxxxx> napisał/a: >>> On Wed, 24 Jul 2024 22:35:49 +0200 >>> Mateusz Jończyk <mat.jonczyk@xxxxx> wrote: >>> >>>> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze: >>>>> W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze: >>>>>> Hello, >>>>>> >>>>>> In my laptop, I used to have two RAID1 arrays on top of NVMe and >>>>>> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1 >>>>>> for remaining data (LUKS >>>>>> + LVM + ext4). For performance, I have marked the RAID component >>>>>> device for /dev/md1 on the SATA SSD drive write-mostly, which >>>>>> "means that the 'md' driver will avoid reading from these >>>>>> devices if at all possible" (man mdadm). >>>>>> >>>>>> Recently, the NVMe drive started having problems (PCI AER errors >>>>>> and the controller disappearing), so I removed it from the >>>>>> arrays and wiped it. However, I have reseated the drive in the >>>>>> M.2 socket and this apparently fixed it (verified with tests). >>>>>> >>>>>> $ cat /proc/mdstat >>>>>> Personalities : [raid1] [linear] [multipath] [raid0] [raid6] >>>>>> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W) >>>>>> 471727104 blocks super 1.2 [2/1] [_U] >>>>>> bitmap: 4/4 pages [16KB], 65536KB chunk >>>>>> >>>>>> md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2] >>>>>> 3142656 blocks super 1.2 [2/2] [UU] >>>>>> bitmap: 0/1 pages [0KB], 65536KB chunk >>>>>> >>>>>> md0 : active raid1 sdb4[3] >>>>>> 2094080 blocks super 1.2 [2/1] [_U] >>>>>> >>>>>> unused devices: <none> >>>>>> >>>>>> (md2 was used just for testing, ignore it). >>>>>> >>>>>> Today, I have tried to add the drive back to the arrays by >>>>>> using a script that executed in quick succession: >>>>>> >>>>>> mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2 >>>>>> mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3 >>>>>> >>>>>> This was on Linux 6.10.0, patched with my previous patch: >>>>>> >>>>>> https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@xxxxx/ >>>>>> >>>>>> (which fixed a regression in the kernel and allows it to start >>>>>> /dev/md1 with a single drive in write-mostly mode). >>>>>> In the background, I was running "rdiff-backup --compare" that >>>>>> was comparing data between my array contents and a backup >>>>>> attached via USB. >>>>>> >>>>>> This, however resulted in mayhem - I was unable to start any >>>>>> program with an input-output error, etc. I used SysRQ + C to >>>>>> save a kernel log: >>>>>> >>>> Hello, >>>> >>>> Unfortunately, hardware failure seems not to be the case. >>>> >>>> I did test it again on 6.10, twice, and in both cases I got >>>> filesystem corruption (but not as severe). >>>> >>>> On Linux 6.1.96 it seems to be working well (also did two tries). >>>> >>>> Please note: in my tests, I was using a RAID component device with >>>> a write-mostly bit set. This setup does not work on 6.9+ out of the >>>> box and requires the following patch: >>>> >>>> commit 36a5c03f23271 ("md/raid1: set max_sectors during early >>>> return from choose_slow_rdev()") >>>> >>>> that is in master now. >>>> >>>> It is also heading into stable, which I'm going to interrupt. Hello, With much effort (challenging to reproduce reliably) I think have nailed down the issue to the read_balance refactoring series in 6.9: 86b1e613eb3b Merge tag 'md-6.9-20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block e81faa91a580 Merge branch 'raid1-read_balance' into md-6.9 0091c5a269ec md/raid1: factor out helpers to choose the best rdev from read_balance() ba58f57fdf98 md/raid1: factor out the code to manage sequential IO 9f3ced792203 md/raid1: factor out choose_bb_rdev() from read_balance() dfa8ecd167c1 md/raid1: factor out choose_slow_rdev() from read_balance() 31a73331752d md/raid1: factor out read_first_rdev() from read_balance() f10920762955 md/raid1-10: factor out a new helper raid1_should_read_first() f29841ff3b27 md/raid1-10: add a helper raid1_check_read_range() 257ac239ffcf md/raid1: fix choose next idle in read_balance() 2c27d09d3a76 md/raid1: record nonrot rdevs while adding/removing rdevs to conf 969d6589abcb md/raid1: factor out helpers to add rdev to conf 3a0f007b6979 md: add a new helper rdev_has_badblock() In particular, 86b1e613eb3b is definitely bad, and 13fe8e6825e4 is 95% good. I was testing with the following two commits on top of the series to make this setup work for me: commit 36a5c03f23271 ("md/raid1: set max_sectors during early return from choose_slow_rdev()") commit b561ea56a264 ("block: allow device to have both virt_boundary_mask and max segment size") After code analysis, I have noticed that the following check that was present in old read_balance() is not present (in equivalent form in the new code): if (!test_bit(In_sync, &rdev->flags) && rdev->recovery_offset < this_sector + sectors) continue; (in choose_slow_rdev() and choose_first_rdev() and possibly other functions) which would cause the kernel to read from the device being synced to before it is ready. In my debug patch (I'll send in a while), I have copied the check to raid1_check_read_range and it seems that the problems do not happen any longer with it. I'm not so sure now that this bug is limited to write-mostly though - previous tests may have been unreliable. #regzbot introduced: 13fe8e6825e4..86b1e613eb3b #regzbot monitor: https://lore.kernel.org/lkml/20240724141906.10b4fc4e@peluse-desk5/T/#m671d6d3a7eda44d39d0882864a98824f52c52917 Greetings, Mateusz