On 5/31/22 12:35 AM, Logan Gunthorpe wrote:
On 2022-05-30 03:55, Guoqing Jiang wrote:
I tried with 5.18.0-rc3, no problem for 07reshape5intr (will investigate
why it failed with this patch), but 07revert-grow still failed without
apply this one.
From fail07revert-grow.log, it shows below issues.
[ 7856.233515] mdadm[25246]: segfault at 0 ip 000000000040fe56 sp
00007ffdcf252800 error 4 in mdadm[400000+81000]
[ 7856.233544] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31
f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89
de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48
[ 7866.871747] mdadm[25463]: segfault at 0 ip 000000000040fe56 sp
00007ffe91e39800 error 4 in mdadm[400000+81000]
[ 7866.871760] Code: 00 48 8d 7c 24 30 e8 79 30 ff ff 48 8d 7c 24 30 31
f6 31 c0 e8 db 34 ff ff 85 c0 79 77 bf 26 50 46 00 b9 04 00 00 00 48 89
de <f3> a6 0f 97 c0 1c 00 84 c0 75 18 e8 fa 36 ff ff 48 0f be 53 04 48
[ 7876.779855] ======================================================
[ 7876.779858] WARNING: possible circular locking dependency detected
[ 7876.779861] 5.18.0-rc3-57-default #28 Tainted: G E
[ 7876.779864] ------------------------------------------------------
[ 7876.779867] mdadm/25444 is trying to acquire lock:
[ 7876.779870] ffff991817749938 ((wq_completion)md_misc){+.+.}-{0:0},
at: flush_workqueue+0x87/0x470
[ 7876.779879]
but task is already holding lock:
[ 7876.779882] ffff9917c5c1c2c0 (&mddev->reconfig_mutex){+.+.}-{3:3},
at: action_store+0x11a/0x2c0 [md_mod]
[ 7876.779892]
which lock already depends on the new lock.
Hmm, strange. I'm definitely running with lockdep and even if I try the
test on my machine, on v5.18-rc3, I don't get this error. Not sure why.
In any case it looks like we recently added a
flush_workqueue(md_misc_wq) call in action_store() which runs with the
mddev_lock() held. According to your lockdep warning, that can deadlock.
It was originally added by f851b60db if I am not misunderstood.
That call was added in this commit:
Fixes: cc1ffe61c026 ("md: add new workqueue for delete rdev")
The above fix commit didn't add it. And cc1ffe61c026 was added to avoid
other
lockdep warnings, IIRC it just added work_pending checking before flush.
Can we maybe run flush_workqueue() before we take mddev_lock()?
Currently, I am not sure, need to investigate and test. Anyway, it is on
my todo
list unless someone beats me 😉.
Thanks,
Guoqing