On Sun, Feb 04, 2024 at 09:35:09AM +0800, Yu Kuai wrote: > Hi, > > 在 2024/02/03 11:19, Benjamin Marzinski 写道: > > On Thu, Feb 01, 2024 at 05:25:45PM +0800, Yu Kuai wrote: > > > From: Yu Kuai <yukuai3@xxxxxxxxxx> > > > I apply this patchset on top of v6.8-rc1, and run lvm2 tests suite with > > > folling cmd for 24 round(for about 2 days): > > > > > > for t in `ls test/shell`; do > > > if cat test/shell/$t | grep raid &> /dev/null; then > > > make check T=shell/$t > > > fi > > > done > > > > > > failed count failed test > > > 1 ### failed: [ndev-vanilla] shell/dmsecuretest.sh > > > 1 ### failed: [ndev-vanilla] shell/dmsetup-integrity-keys.sh > > > 1 ### failed: [ndev-vanilla] shell/dmsetup-keyring.sh > > > 5 ### failed: [ndev-vanilla] shell/duplicate-pvs-md0.sh > > > 1 ### failed: [ndev-vanilla] shell/duplicate-vgid.sh > > > 2 ### failed: [ndev-vanilla] shell/duplicate-vgnames.sh > > > 1 ### failed: [ndev-vanilla] shell/fsadm-crypt.sh > > > 1 ### failed: [ndev-vanilla] shell/integrity.sh > > > 6 ### failed: [ndev-vanilla] shell/lvchange-raid1-writemostly.sh > > > 2 ### failed: [ndev-vanilla] shell/lvchange-rebuild-raid.sh > > > 5 ### failed: [ndev-vanilla] shell/lvconvert-raid-reshape-stripes-load-reload.sh > > > 4 ### failed: [ndev-vanilla] shell/lvconvert-raid-restripe-linear.sh > > > 1 ### failed: [ndev-vanilla] shell/lvconvert-raid1-split-trackchanges.sh > > > 20 ### failed: [ndev-vanilla] shell/lvconvert-repair-raid.sh > > > 20 ### failed: [ndev-vanilla] shell/lvcreate-large-raid.sh > > > 24 ### failed: [ndev-vanilla] shell/lvextend-raid.sh > > > > > > And I ramdomly pick some tests verified by hand that these test will > > > fail in v6.6 as well(not all tests): > > > > > > shell/lvextend-raid.sh > > > shell/lvcreate-large-raid.sh > > > shell/lvconvert-repair-raid.sh > > > shell/lvchange-rebuild-raid.sh > > > shell/lvchange-raid1-writemostly.sh > > > > In my testing with this patchset on top of the head of linus's tree > > (5c24e4e9e708) I am seeing failures in > > shell/lvconvert-raid-reshape-stripes-load-reload.sh and > > shell/lvconvert-repair-raid.sh in about 20% of my runs. I have never > > seen either of these these fail running on the 6.6 kernel (ffc253263a13). > > This sounds quite different in my testing, as I said, the test > > shell/lvconvert-repair-raid.sh is very likely to fail in v6.6 already, > I don't know why it never fail in your testing, test log in v6.6: > > | [ 1:38.162] #lvconvert-repair-raid.sh:1+ aux teardown > | [ 1:38.162] ## teardown.......## removing stray mapped devices with names > beginning with LVMTEST3474: > | [ 1:39.207] .set +vx; STACKTRACE; set -vx > | [ 1:41.448] ##lvconvert-repair-raid.sh:1+ set +vx > | [ 1:41.448] ## - /mnt/test/lvm2/test/shell/lvconvert-repair-raid.sh:1 > | [ 1:41.449] ## 1 STACKTRACE() called from > /mnt/test/lvm2/test/shell/lvconvert-repair-raid.sh:1 > | [ 1:41.449] ## ERROR: The test started dmeventd (3718) unexpectedly. > > And the same in v6.8-rc1. Perhaps do you know how to fix this error? Could you run the test with something like # make check_local T=lvconvert-repair-raid.sh VERBOSE=1 > out 2>&1 and post the output. -Ben > Thanks, > Kuai > > > > > lvconvert-repair-raid.sh creates a raid array and then disables one if > > its drives before there's enough time to finish the initial sync and > > tries to repair it. This is supposed to fail (it uses dm-delay devices > > to slow down the sync). When the test succeeds, I see things like this: > > > > [ 0:13.469] #lvconvert-repair-raid.sh:161+ lvcreate --type raid10 -m 1 -i 2 -L 64 -n LV1 LVMTEST191946vg /tmp/LVMTEST191946.ImUMG6dyqB/dev/mapper/LVMTEST191946pv1 /tmp/LVMTEST191946.ImUMG6dyqB/dev/mapper/LVMTEST191946pv2 /tmp/LVMTEST191946.ImUMG6dyqB/dev/mapper/LVMTEST191946pv3 /tmp/LVMTEST191946.ImUMG6dyqB/dev/mapper/LVMTEST191946pv4 > > [ 0:13.469] Using default stripesize 64.00 KiB. > > [ 0:13.483] Logical volume "LV1" created. > > [ 0:14.042] 6,8908,1194343108,-;device-mapper: raid: Superblocks created for new raid set > > [ 0:14.042] 5,8909,1194348704,-;md/raid10:mdX: not clean -- starting background reconstruction > > [ 0:14.042] 6,8910,1194349443,-;md/raid10:mdX: active with 4 out of 4 devices > > [ 0:14.042] 4,8911,1194459161,-;mdX: bitmap file is out of date, doing full recovery > > [ 0:14.042] 6,8912,1194563810,-;md: resync of RAID array mdX > > [ 0:14.042] WARNING: This metadata update is NOT backed up. > > [ 0:14.042] aux disable_dev "$dev4" > > [ 0:14.058] #lvconvert-repair-raid.sh:163+ aux disable_dev /tmp/LVMTEST191946.ImUMG6dyqB/dev/mapper/LVMTEST191946pv4 > > [ 0:14.058] Disabling device /tmp/LVMTEST191946.ImUMG6dyqB/dev/mapper/LVMTEST191946pv4 (253:5) > > [ 0:14.101] not lvconvert -y --repair $vg/$lv1 > > > > When it fails, I see: > > > > [ 0:13.831] #lvconvert-repair-raid.sh:161+ lvcreate --type raid10 -m 1 -i 2 -L 64 -n LV1 LVMTEST192248vg /tmp/LVMTEST192248.ATcecgSGfE/dev/mapper/LVMTEST192248pv1 /tmp/LVMTEST192248.ATcecgSGfE/dev/mapper/LVMTEST192248pv2 /tmp/LVMTEST192248.ATcecgSGfE/dev/mapper/LVMTEST192248pv3 /tmp/LVMTEST192248.ATcecgSGfE/dev/mapper/LVMTEST192248pv4 > > [ 0:13.831] Using default stripesize 64.00 KiB. > > [ 0:13.847] Logical volume "LV1" created. > > [ 0:14.499] WARNING: This metadata update is NOT backed up. > > [ 0:14.499] 6,8925,1187444256,-;device-mapper: raid: Superblocks created for new raid set > > [ 0:14.499] 5,8926,1187449525,-;md/raid10:mdX: not clean -- starting background reconstruction > > [ 0:14.499] 6,8927,1187450148,-;md/raid10:mdX: active with 4 out of 4 devices > > [ 0:14.499] 6,8928,1187452472,-;md: resync of RAID array mdX > > [ 0:14.499] 6,8929,1187453016,-;md: mdX: resync done. > > [ 0:14.499] 4,8930,1187555486,-;mdX: bitmap file is out of date, doing full recovery > > [ 0:14.499] aux disable_dev "$dev4" > > [ 0:14.515] #lvconvert-repair-raid.sh:163+ aux disable_dev /tmp/LVMTEST192248.AT > > cecgSGfE/dev/mapper/LVMTEST192248pv4 > > [ 0:14.515] Disabling device /tmp/LVMTEST192248.ATcecgSGfE/dev/mapper/LVMTEST192 > > 248pv4 (253:5) > > [ 0:14.554] not lvconvert -y --repair $vg/$lv1 > > > > To me the important looking difference (and I admit, I'm no RAID expert), is that in the > > case where the test passes (where lvconvert fails as expected), I see > > > > [ 0:14.042] 4,8911,1194459161,-;mdX: bitmap file is out of date, doing full recovery > > [ 0:14.042] 6,8912,1194563810,-;md: resync of RAID array mdX > > > > When it fails I see: > > > > [ 0:14.499] 6,8928,1187452472,-;md: resync of RAID array mdX > > [ 0:14.499] 6,8929,1187453016,-;md: mdX: resync done. > > [ 0:14.499] 4,8930,1187555486,-;mdX: bitmap file is out of date, doing full recovery > > > > Which appears to show a resync that takes no time, presumable because it happens before > > the device notices that the bitmaps are wrong and schedules a full recovery. > > > > > > lvconvert-raid-reshape-stripes-load-reload.sh repeatedly reloads the > > device table during a raid reshape, and then tests the filesystem for > > corruption afterwards. With this patchset, the filesystem is > > occasionally corrupted. I do not see this with the 6.6 kernel. > > > > -Ben > > > Xiao Ni also test the last version on a real machine, see [1]. > > > > > > [1] https://lore.kernel.org/all/CALTww29QO5kzmN6Vd+jT=-8W5F52tJjHKSgrfUc1Z1ZAeRKHHA@xxxxxxxxxxxxxx/ > > > > > > Yu Kuai (14): > > > md: don't ignore suspended array in md_check_recovery() > > > md: don't ignore read-only array in md_check_recovery() > > > md: make sure md_do_sync() will set MD_RECOVERY_DONE > > > md: don't register sync_thread for reshape directly > > > md: don't suspend the array for interrupted reshape > > > md: fix missing release of 'active_io' for flush > > > md: export helpers to stop sync_thread > > > md: export helper md_is_rdwr() > > > dm-raid: really frozen sync_thread during suspend > > > md/dm-raid: don't call md_reap_sync_thread() directly > > > dm-raid: add a new helper prepare_suspend() in md_personality > > > md/raid456: fix a deadlock for dm-raid456 while io concurrent with > > > reshape > > > dm-raid: fix lockdep waring in "pers->hot_add_disk" > > > dm-raid: remove mddev_suspend/resume() > > > > > > drivers/md/dm-raid.c | 78 +++++++++++++++++++-------- > > > drivers/md/md.c | 126 +++++++++++++++++++++++++++++-------------- > > > drivers/md/md.h | 16 ++++++ > > > drivers/md/raid10.c | 16 +----- > > > drivers/md/raid5.c | 61 +++++++++++---------- > > > 5 files changed, 192 insertions(+), 105 deletions(-) > > > > > > -- > > > 2.39.2 > > > > > > > . > >