On Tue, Jan 30, 2024 at 6:41 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > Hi, Blazej! > > 在 2024/01/31 0:26, Blazej Kucman 写道: > > Hi, > > > > On Fri, 26 Jan 2024 08:46:10 -0700 > > Dan Moulding <dan@xxxxxxxx> wrote: > >> > >> That's a good suggestion, so I switched it to use XFS. It can still > >> reproduce the hang. Sounds like this is probably a different problem > >> than the known ext4 one. > >> > > > > Our daily tests directed at mdadm/md also detected a problem with > > identical symptoms as described in the thread. > > > > Issue detected with IMSM metadata but it also reproduces with native > > metadata. > > NVMe disks under VMD controller were used. > > > > Scenario: > > 1. Create raid10: > > mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128 > > --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1 > > --size=7864320 --run > > 2. Create FS > > mkfs.ext4 /dev/md/r10d4s128-15_A > > 3. Set faulty one raid member: > > mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1 > > 4. Stop raid devies: > > mdadm -Ss > > > > Expected result: > > The raid stops without kernel hangs and errors. > > > > Actual result: > > command "mdadm -Ss" hangs, > > hung_task occurs in OS. > > Can you test the following patch? > > Thanks! > Kuai > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index e3a56a958b47..a8db84c200fe 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -578,8 +578,12 @@ static void submit_flushes(struct work_struct *ws) > rcu_read_lock(); > } > rcu_read_unlock(); > - if (atomic_dec_and_test(&mddev->flush_pending)) > + if (atomic_dec_and_test(&mddev->flush_pending)) { > + /* The pair is percpu_ref_get() from md_flush_request() */ > + percpu_ref_put(&mddev->active_io); > + > queue_work(md_wq, &mddev->flush_work); > + } > } > > static void md_submit_flush_data(struct work_struct *ws) This fixes the issue in my tests. Please submit the official patch. Also, we should add a test in mdadm/tests to cover this case. Thanks, Song