Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held

Guoqing Jiang <guoqing.jiang@xxxxxxxxx> · Tue, 31 May 2022 16:13:54 +0800

On 5/31/22 12:35 AM, Logan Gunthorpe wrote:
On 2022-05-30 03:55, Guoqing Jiang wrote:
I tried with 5.18.0-rc3, no problem for 07reshape5intr (will investigate
why it failed with this patch), but 07revert-grow still failed without
apply this one.

  From fail07revert-grow.log, it shows below issues.

[ 7856.233515] mdadm[25246]: segfault at 0 ip 000000000040fe56 sp
00007ffdcf252800 error 4 in mdadm[400000+81000]
[ 7856.233544] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31
f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89
de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48

[ 7866.871747] mdadm[25463]: segfault at 0 ip 000000000040fe56 sp
00007ffe91e39800 error 4 in mdadm[400000+81000]
[ 7866.871760] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31
f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89
de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48

[ 7876.779855] ======================================================
[ 7876.779858] WARNING: possible circular locking dependency detected
[ 7876.779861] 5.18.0-rc3-57-default #28 Tainted: G            E
[ 7876.779864] ------------------------------------------------------
[ 7876.779867] mdadm/25444 is trying to acquire lock:
[ 7876.779870] ffff991817749938 ((wq_completion)md_misc){+.+.}-{0:0},
at: flush_workqueue+0x87/0x470
[ 7876.779879]
                 but task is already holding lock:
[ 7876.779882] ffff9917c5c1c2c0 (&mddev->reconfig_mutex){+.+.}-{3:3},
at: action_store+0x11a/0x2c0 [md_mod]
[ 7876.779892]
                 which lock already depends on the new lock.

Hmm, strange. I'm definitely running with lockdep and even if I try the
test on my machine, on v5.18-rc3, I don't get this error. Not sure why.

In any case it looks like we recently added a
flush_workqueue(md_misc_wq) call in action_store() which runs with the
mddev_lock() held. According to your lockdep warning, that can deadlock.

It was originally added by f851b60db if I am not misunderstood.

That call was added in this commit:

Fixes: cc1ffe61c026 ("md: add new workqueue for delete rdev")

The above fix commit didn't add it. And cc1ffe61c026 was added to avoid 
other
lockdep warnings, IIRC it just added work_pending checking before flush.

Can we maybe run flush_workqueue() before we take mddev_lock()?

Currently, I am not sure, need to investigate and test. Anyway, it is on 
my todo
list unless someone beats me 😉.

Thanks,
Guoqing