Re: [REGRESSION] Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

在 2024/07/31 4:35, Mateusz Jończyk 写道:
W dniu 28.07.2024 o 12:30, Mateusz Jończyk pisze:
W dniu 25.07.2024 o 16:27, Paul E Luse pisze:
On Thu, 25 Jul 2024 09:15:40 +0200
Mateusz Jończyk <mat.jonczyk@xxxxx> wrote:

Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse
<paul.e.luse@xxxxxxxxxxxxxxx> napisał/a:
On Wed, 24 Jul 2024 22:35:49 +0200
Mateusz Jończyk <mat.jonczyk@xxxxx> wrote:

W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
Hello,

In my laptop, I used to have two RAID1 arrays on top of NVMe and
SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
for remaining data (LUKS
+ LVM + ext4). For performance, I have marked the RAID component
device for /dev/md1 on the SATA SSD drive write-mostly, which
"means that the 'md' driver will avoid reading from these
devices if at all possible" (man mdadm).

Recently, the NVMe drive started having problems (PCI AER errors
and the controller disappearing), so I removed it from the
arrays and wiped it. However, I have reseated the drive in the
M.2 socket and this apparently fixed it (verified with tests).

     $ cat /proc/mdstat
     Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
[raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
           471727104 blocks super 1.2 [2/1] [_U]
           bitmap: 4/4 pages [16KB], 65536KB chunk

     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
           3142656 blocks super 1.2 [2/2] [UU]
           bitmap: 0/1 pages [0KB], 65536KB chunk

     md0 : active raid1 sdb4[3]
           2094080 blocks super 1.2 [2/1] [_U]
    unused devices: <none>

(md2 was used just for testing, ignore it).

Today, I have tried to add the drive back to the arrays by
using a script that executed in quick succession:

     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3

This was on Linux 6.10.0, patched with my previous patch:

     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@xxxxx/

(which fixed a regression in the kernel and allows it to start
/dev/md1 with a single drive in write-mostly mode).
In the background, I was running "rdiff-backup --compare" that
was comparing data between my array contents and a backup
attached via USB.

This, however resulted in mayhem - I was unable to start any
program with an input-output error, etc. I used SysRQ + C to
save a kernel log:

Hello,

Unfortunately, hardware failure seems not to be the case.

I did test it again on 6.10, twice, and in both cases I got
filesystem corruption (but not as severe).

On Linux 6.1.96 it seems to be working well (also did two tries).

Please note: in my tests, I was using a RAID component device with
a write-mostly bit set. This setup does not work on 6.9+ out of the
box and requires the following patch:

commit 36a5c03f23271 ("md/raid1: set max_sectors during early
return from choose_slow_rdev()")

that is in master now.

It is also heading into stable, which I'm going to interrupt.
Hello,

With much effort (challenging to reproduce reliably) I think have nailed down the issue to the read_balance refactoring series in 6.9:
[snip]
After code analysis, I have noticed that the following check that was present in old
read_balance() is not present (in equivalent form in the new code):

                 if (!test_bit(In_sync, &rdev->flags) &&
                     rdev->recovery_offset < this_sector + sectors)
                         continue;

(in choose_slow_rdev() and choose_first_rdev() and possibly other functions)

which would cause the kernel to read from the device being synced to before
it is ready.

Hello,

I think have made a reliable (and safe) reproducer for this bug:

Prerequisite: create an array on top of 2 devices 1GB+ large:

mdadm --create /dev/md4 --level=1 --raid-devices=2 /dev/nvme0n1p5 --write-mostly /dev/sdb8
The script:
-------------------------------8<------------------------

#!/bin/bash

mdadm /dev/md4 --fail /dev/nvme0n1p5
sleep 1
mdadm /dev/md4 --remove failed
sleep 1

# fill with random data
shred -n1 -v /dev/md4
# fill with zeros
shred -n0 -zv /dev/nvme0n1p5

sha256sum /dev/md4

echo 1 > /proc/sys/vm/drop_caches

date

# calculate a shasum while the array is being synced
( sha256sum /dev/md4; date ) &
mdadm /dev/md4 --add --readwrite /dev/nvme0n1p5
date

-------------------------------8<------------------------

The two shasums should be equal, but they were different in my tests on affected kernels.

Also, in my tests with the script, *without* a write-mostly device in the array, the problems did not happen.

Thanks for the test,

Can you send a new version of patch, and this test to mdadm?
Kuai


Greetings,

Mateusz

.






[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux