Hi,
在 2023/03/22 22:32, Guoqing Jiang 写道:
Could you explain how the same work can be re-queued? Isn't the
PENDING_BIT
is already set in t3? I believe queue_work shouldn't do that per the
comment
but I am not expert ...
This is not related to workqueue, it is just because raid10
reinitialize the work that is already queued,
I am trying to understand the possibility.
like I discribed later in t3:
t2:
md_check_recovery:
INIT_WORK -> clear pending
queue_work -> set pending
list_add_tail
...
t3: -> work is still pending
md_check_recovery:
INIT_WORK -> clear pending
queue_work -> set pending
list_add_tail -> list is corrupted
First, t2 and t3 can't be run in parallel since reconfig_mutex must be
held. And if sync_thread existed,
the second process would unregister and reap sync_thread which means the
second process will
call INIT_WORK and queue_work again.
Maybe your description is valid, I would prefer call work_pending and
flush_workqueue instead of
INIT_WORK and queue_work.
This is not enough, it's right this can avoid list corruption, but the
worker function md_start_sync just register a sync_thread, and
md_do_sync() can still in progress, hence this can't prevent a new
sync_thread to start while the old one is not done, some other problems
like deadlock can still be triggered.
Of course, our 5.10 and mainline are the same,
there are some tests:
First the deadlock can be reporduced reliably, test script is simple:
mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]
So this is raid10 while the previous problem was appeared in raid456, I
am not sure it is the same
issue, but let's see.
Ok, I'm not quite familiar with raid456 yet, however, the problem is
still related to that action_store hold mutex to unregister sync_thread,
right?
Then, the problem MD_RECOVERY_RUNNING can be cleared can't be reporduced
reliably, usually it takes 2+ days to triggered a problem, and each time
problem phenomenon can be different, I'm hacking the kernel and add
some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
test can trigger the BUG_ON:
Also your debug patch obviously added large delay which make the
calltrace happen, I doubt
if user can hit it in real life. Anyway, will try below test from my side.
mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
sleep 5
echo 1 > /sys/module/md_mod/parameters/set_delay
echo idle > /sys/block/md0/md/sync_action &
sleep 5
echo "want_replacement" > /sys/block/md0/md/dev-sdd/state
test result:
[ 228.390237] md_check_recovery: running is set
[ 228.391376] md_check_recovery: queue new sync thread
[ 233.671041] action_store unregister success! delay 10s
[ 233.689276] md_check_recovery: running is set
[ 238.722448] md_check_recovery: running is set
[ 238.723328] md_check_recovery: queue new sync thread
[ 238.724851] md_do_sync: before new wor, sleep 10s
[ 239.725818] md_do_sync: delay done
[ 243.674828] action_store delay done
[ 243.700102] md_reap_sync_thread: running is cleared!
[ 243.748703] ------------[ cut here ]------------
[ 243.749656] kernel BUG at drivers/md/md.c:9084!
After your debug patch applied, is L9084 points to below?
9084 mddev->curr_resync = MaxSector;
In my environment, it's a BUG_ON() that I added in md_do_sync:
9080 skip:
9081 /* set CHANGE_PENDING here since maybe another update is
needed,
9082 ┊* so other nodes are informed. It should be harmless for
normal
9083 ┊* raid */
9084 BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
9085 set_mask_bits(&mddev->sb_flags, 0,
9086 ┊ BIT(MD_SB_CHANGE_PENDING) |
BIT(MD_SB_CHANGE_DEVS));
I don't understand how it triggers below calltrace, and it has nothing
to do with
list corruption, right?
Yes, this is just a early BUG_ON() to detect that if MD_RECOVERY_RUNNING
is cleared while sync_thread is still in progress.
Thanks,
Kuai