On Thu, Feb 01, 2024 at 05:25:45PM +0800, Yu Kuai wrote:
From: Yu Kuai <yukuai3@xxxxxxxxxx>
I apply this patchset on top of v6.8-rc1, and run lvm2 tests suite with
folling cmd for 24 round(for about 2 days):
for t in `ls test/shell`; do
if cat test/shell/$t | grep raid &> /dev/null; then
make check T=shell/$t
fi
done
failed count failed test
1 ### failed: [ndev-vanilla] shell/dmsecuretest.sh
1 ### failed: [ndev-vanilla]
shell/dmsetup-integrity-keys.sh
1 ### failed: [ndev-vanilla] shell/dmsetup-keyring.sh
5 ### failed: [ndev-vanilla] shell/duplicate-pvs-md0.sh
1 ### failed: [ndev-vanilla] shell/duplicate-vgid.sh
2 ### failed: [ndev-vanilla] shell/duplicate-vgnames.sh
1 ### failed: [ndev-vanilla] shell/fsadm-crypt.sh
1 ### failed: [ndev-vanilla] shell/integrity.sh
6 ### failed: [ndev-vanilla]
shell/lvchange-raid1-writemostly.sh
2 ### failed: [ndev-vanilla] shell/lvchange-rebuild-raid.sh
5 ### failed: [ndev-vanilla]
shell/lvconvert-raid-reshape-stripes-load-reload.sh
4 ### failed: [ndev-vanilla]
shell/lvconvert-raid-restripe-linear.sh
1 ### failed: [ndev-vanilla]
shell/lvconvert-raid1-split-trackchanges.sh
20 ### failed: [ndev-vanilla] shell/lvconvert-repair-raid.sh
20 ### failed: [ndev-vanilla] shell/lvcreate-large-raid.sh
24 ### failed: [ndev-vanilla] shell/lvextend-raid.sh
And I ramdomly pick some tests verified by hand that these test will
fail in v6.6 as well(not all tests):
shell/lvextend-raid.sh
shell/lvcreate-large-raid.sh
shell/lvconvert-repair-raid.sh
shell/lvchange-rebuild-raid.sh
shell/lvchange-raid1-writemostly.sh
In my testing with this patchset on top of the head of linus's tree
(5c24e4e9e708) I am seeing failures in
shell/lvconvert-raid-reshape-stripes-load-reload.sh and
shell/lvconvert-repair-raid.sh in about 20% of my runs. I have never
seen either of these these fail running on the 6.6 kernel (ffc253263a13).
lvconvert-repair-raid.sh creates a raid array and then disables one if
its drives before there's enough time to finish the initial sync and
tries to repair it. This is supposed to fail (it uses dm-delay devices
to slow down the sync). When the test succeeds, I see things like this:
[ 0:13.469] #lvconvert-repair-raid.sh:161+ lvcreate --type raid10 -m 1
-i 2 -L 64 -n LV1 LVMTEST191946vg
/tmp/LVMTEST191946.ImUMG6dyqB/dev/mapper/LVMTEST191946pv1
/tmp/LVMTEST191946.ImUMG6dyqB/dev/mapper/LVMTEST191946pv2
/tmp/LVMTEST191946.ImUMG6dyqB/dev/mapper/LVMTEST191946pv3
/tmp/LVMTEST191946.ImUMG6dyqB/dev/mapper/LVMTEST191946pv4
[ 0:13.469] Using default stripesize 64.00 KiB.
[ 0:13.483] Logical volume "LV1" created.
[ 0:14.042] 6,8908,1194343108,-;device-mapper: raid: Superblocks
created for new raid set
[ 0:14.042] 5,8909,1194348704,-;md/raid10:mdX: not clean -- starting
background reconstruction
[ 0:14.042] 6,8910,1194349443,-;md/raid10:mdX: active with 4 out of 4
devices
[ 0:14.042] 4,8911,1194459161,-;mdX: bitmap file is out of date, doing
full recovery
[ 0:14.042] 6,8912,1194563810,-;md: resync of RAID array mdX
[ 0:14.042] WARNING: This metadata update is NOT backed up.
[ 0:14.042] aux disable_dev "$dev4"
[ 0:14.058] #lvconvert-repair-raid.sh:163+ aux disable_dev
/tmp/LVMTEST191946.ImUMG6dyqB/dev/mapper/LVMTEST191946pv4
[ 0:14.058] Disabling device
/tmp/LVMTEST191946.ImUMG6dyqB/dev/mapper/LVMTEST191946pv4 (253:5)
[ 0:14.101] not lvconvert -y --repair $vg/$lv1
When it fails, I see:
[ 0:13.831] #lvconvert-repair-raid.sh:161+ lvcreate --type raid10 -m 1
-i 2 -L 64 -n LV1 LVMTEST192248vg
/tmp/LVMTEST192248.ATcecgSGfE/dev/mapper/LVMTEST192248pv1
/tmp/LVMTEST192248.ATcecgSGfE/dev/mapper/LVMTEST192248pv2
/tmp/LVMTEST192248.ATcecgSGfE/dev/mapper/LVMTEST192248pv3
/tmp/LVMTEST192248.ATcecgSGfE/dev/mapper/LVMTEST192248pv4
[ 0:13.831] Using default stripesize 64.00 KiB.
[ 0:13.847] Logical volume "LV1" created.
[ 0:14.499] WARNING: This metadata update is NOT backed up.
[ 0:14.499] 6,8925,1187444256,-;device-mapper: raid: Superblocks
created for new raid set
[ 0:14.499] 5,8926,1187449525,-;md/raid10:mdX: not clean -- starting
background reconstruction
[ 0:14.499] 6,8927,1187450148,-;md/raid10:mdX: active with 4 out of 4
devices
[ 0:14.499] 6,8928,1187452472,-;md: resync of RAID array mdX
[ 0:14.499] 6,8929,1187453016,-;md: mdX: resync done.
[ 0:14.499] 4,8930,1187555486,-;mdX: bitmap file is out of date, doing
full recovery
[ 0:14.499] aux disable_dev "$dev4"
[ 0:14.515] #lvconvert-repair-raid.sh:163+ aux disable_dev
/tmp/LVMTEST192248.AT
cecgSGfE/dev/mapper/LVMTEST192248pv4
[ 0:14.515] Disabling device
/tmp/LVMTEST192248.ATcecgSGfE/dev/mapper/LVMTEST192
248pv4 (253:5)
[ 0:14.554] not lvconvert -y --repair $vg/$lv1
To me the important looking difference (and I admit, I'm no RAID
expert), is that in the
case where the test passes (where lvconvert fails as expected), I see
[ 0:14.042] 4,8911,1194459161,-;mdX: bitmap file is out of date, doing
full recovery
[ 0:14.042] 6,8912,1194563810,-;md: resync of RAID array mdX
When it fails I see:
[ 0:14.499] 6,8928,1187452472,-;md: resync of RAID array mdX
[ 0:14.499] 6,8929,1187453016,-;md: mdX: resync done.
[ 0:14.499] 4,8930,1187555486,-;mdX: bitmap file is out of date, doing
full recovery
Which appears to show a resync that takes no time, presumable because
it happens before
the device notices that the bitmaps are wrong and schedules a full
recovery.
lvconvert-raid-reshape-stripes-load-reload.sh repeatedly reloads the
device table during a raid reshape, and then tests the filesystem for
corruption afterwards. With this patchset, the filesystem is
occasionally corrupted. I do not see this with the 6.6 kernel.