On Mon, Mar 4, 2024 at 9:24 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > Hi, > > 在 2024/03/04 9:07, Yu Kuai 写道: > > Hi, > > > > 在 2024/03/03 21:16, Xiao Ni 写道: > >> Hi all > >> > >> There is a error report from lvm regression tests. The case is > >> lvconvert-raid-reshape-stripes-load-reload.sh. I saw this error when I > >> tried to fix dmraid regression problems too. In my patch set, after > >> reverting ad39c08186f8a0f221337985036ba86731d6aafe (md: Don't register > >> sync_thread for reshape directly), this problem doesn't appear. > > Hi Kuai > > How often did you see this tes failed? I'm running the tests for over > > two days now, for 30+ rounds, and this test never fail in my VM. I ran 5 times and it failed 2 times just now. > > Take a quick look, there is still a path from raid10 that > MD_RECOVERY_FROZEN can be cleared, and in theroy this problem can be > triggered. Can you test the following patch on the top of this set? > I'll keep running the test myself. Sure, I'll give the result later. Regards Xiao > > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > index a5f8419e2df1..7ca29469123a 100644 > --- a/drivers/md/raid10.c > +++ b/drivers/md/raid10.c > @@ -4575,7 +4575,8 @@ static int raid10_start_reshape(struct mddev *mddev) > return 0; > > abort: > - mddev->recovery = 0; > + if (mddev->gendisk) > + mddev->recovery = 0; > spin_lock_irq(&conf->device_lock); > conf->geo = conf->prev; > mddev->raid_disks = conf->geo.raid_disks; > > Thanks, > Kuai > > > > Thanks, > > Kuai > > > >> > >> I put the log in the attachment. > >> > >> On Fri, Mar 1, 2024 at 6:03 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > >>> > >>> From: Yu Kuai <yukuai3@xxxxxxxxxx> > >>> > >>> link to part1: > >>> https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@xxxxxxxxxxxxxx/ > >>> > >>> > >>> part1 contains fixes for deadlocks for stopping sync_thread > >>> > >>> This set contains fixes: > >>> - reshape can start unexpected, cause data corruption, patch 1,5,6; > >>> - deadlocks that reshape concurrent with IO, patch 8; > >>> - a lockdep warning, patch 9; > >>> > >>> I'm runing lvm2 tests with following scripts with a few rounds now, > >>> > >>> for t in `ls test/shell`; do > >>> if cat test/shell/$t | grep raid &> /dev/null; then > >>> make check T=shell/$t > >>> fi > >>> done > >>> > >>> There are no deadlock and no fs corrupt now, however, there are still > >>> four > >>> failed tests: > >>> > >>> ### failed: [ndev-vanilla] shell/lvchange-raid1-writemostly.sh > >>> ### failed: [ndev-vanilla] shell/lvconvert-repair-raid.sh > >>> ### failed: [ndev-vanilla] shell/lvcreate-large-raid.sh > >>> ### failed: [ndev-vanilla] shell/lvextend-raid.sh > >>> > >>> And failed reasons are the same: > >>> > >>> ## ERROR: The test started dmeventd (147856) unexpectedly > >>> > >>> I have no clue yet, and it seems other folks doesn't have this issue. > >>> > >>> Yu Kuai (9): > >>> md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume > >>> md: export helpers to stop sync_thread > >>> md: export helper md_is_rdwr() > >>> md: add a new helper reshape_interrupted() > >>> dm-raid: really frozen sync_thread during suspend > >>> md/dm-raid: don't call md_reap_sync_thread() directly > >>> dm-raid: add a new helper prepare_suspend() in md_personality > >>> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io > >>> concurrent with reshape > >>> dm-raid: fix lockdep waring in "pers->hot_add_disk" > >>> > >>> drivers/md/dm-raid.c | 93 ++++++++++++++++++++++++++++++++++---------- > >>> drivers/md/md.c | 73 ++++++++++++++++++++++++++-------- > >>> drivers/md/md.h | 38 +++++++++++++++++- > >>> drivers/md/raid5.c | 32 ++++++++++++++- > >>> 4 files changed, 196 insertions(+), 40 deletions(-) > >>> > >>> -- > >>> 2.39.2 > >>> > > > > > > . > > >