Hi,
在 2023/09/24 22:35, Donald Buczek 写道:
On 9/17/23 10:55, Donald Buczek wrote:
On 9/14/23 08:03, Donald Buczek wrote:
On 9/13/23 16:16, Dragan Stancevic wrote:
Hi Donald-
[...]
Here is a list of changes for 6.1:
e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
f71209b1f21c md: enhance checking in md_check_recovery()
753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
6f56f0c4f124 md: add a mutex to synchronize idle and frozen in
action_store()
64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
a865b96c513b Revert "md: unlock mddev before reap sync_thread in
action_store"
Thanks!
I've put these patches on v6.1.52. I've started a script which
transitions the three md-devices of a very active backup server
through idle->check->idle every 6 minutes a few ours ago. It went
through ~400 iterations till now. No lock-ups so far.
Oh dear, looks like the deadlock problem is _not_fixed with these
patches.
Some more info after another incident:
- We've hit the deadlock with 5.15.131 (so it is NOT introduced by any
of the above patches)
- The symptoms are not exactly the same as with the original year-old
problem. Differences:
- - mdX_raid6 is NOT busy looping
- - /sys/devices/virtual/block/mdX/md/array_state says "active" not
"write pending"
- - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does
not resolve the deadlock
- - After hours in the deadlock state the system resumed operation when
a script of mine read(!) lots of sysfs files.
- But in both cases, `echo idle >
/sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O
operation on the raid.
The fact that we didn't hit the problem for many month on 5.15.94 might
hint that it was introduced between 5.15.94 and 5.15.131
We'll try to reproduce the problem on a test machine for analysis, but
this make take time (vacation imminent for one...).
But its not like these patches caused the problem. Any maybe they _did_
fix the original problem, as we didn't hit that one.
Sorry for the late reply, yes, this looks like a different problem. I'm
pretty confident that the orignal problem is fixed since that echo
idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for
sync_thread to be done.
I'll check patches between 5.15.94 and 5.15.131.
Thanks,
Kuai
Best
Donald