Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held

Logan Gunthorpe <logang@xxxxxxxxxxxx> · Mon, 30 May 2022 10:35:34 -0600

On 2022-05-30 03:55, Guoqing Jiang wrote:
> I tried with 5.18.0-rc3, no problem for 07reshape5intr (will investigate 
> why it failed with this patch), but 07revert-grow still failed without
> apply this one.
> 
>  From fail07revert-grow.log, it shows below issues.
> 
> [ 7856.233515] mdadm[25246]: segfault at 0 ip 000000000040fe56 sp 
> 00007ffdcf252800 error 4 in mdadm[400000+81000]
> [ 7856.233544] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 
> f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 
> de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48
> 
> [ 7866.871747] mdadm[25463]: segfault at 0 ip 000000000040fe56 sp 
> 00007ffe91e39800 error 4 in mdadm[400000+81000]
> [ 7866.871760] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31 
> f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89 
> de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48
> 
> [ 7876.779855] ======================================================
> [ 7876.779858] WARNING: possible circular locking dependency detected
> [ 7876.779861] 5.18.0-rc3-57-default #28 Tainted: G            E
> [ 7876.779864] ------------------------------------------------------
> [ 7876.779867] mdadm/25444 is trying to acquire lock:
> [ 7876.779870] ffff991817749938 ((wq_completion)md_misc){+.+.}-{0:0}, 
> at: flush_workqueue+0x87/0x470
> [ 7876.779879]
>                 but task is already holding lock:
> [ 7876.779882] ffff9917c5c1c2c0 (&mddev->reconfig_mutex){+.+.}-{3:3}, 
> at: action_store+0x11a/0x2c0 [md_mod]
> [ 7876.779892]
>                 which lock already depends on the new lock.
> 

Hmm, strange. I'm definitely running with lockdep and even if I try the
test on my machine, on v5.18-rc3, I don't get this error. Not sure why.

In any case it looks like we recently added a
flush_workqueue(md_misc_wq) call in action_store() which runs with the
mddev_lock() held. According to your lockdep warning, that can deadlock.

That call was added in this commit:

Fixes: cc1ffe61c026 ("md: add new workqueue for delete rdev")

Can we maybe run flush_workqueue() before we take mddev_lock()?

Logan