On 3/23/23 09:36, Yu Kuai wrote:
Hi,
在 2023/03/22 22:32, Guoqing Jiang 写道:
Could you explain how the same work can be re-queued? Isn't the
PENDING_BIT
is already set in t3? I believe queue_work shouldn't do that per
the comment
but I am not expert ...
This is not related to workqueue, it is just because raid10
reinitialize the work that is already queued,
I am trying to understand the possibility.
like I discribed later in t3:
t2:
md_check_recovery:
INIT_WORK -> clear pending
queue_work -> set pending
list_add_tail
...
t3: -> work is still pending
md_check_recovery:
INIT_WORK -> clear pending
queue_work -> set pending
list_add_tail -> list is corrupted
First, t2 and t3 can't be run in parallel since reconfig_mutex must
be held. And if sync_thread existed,
the second process would unregister and reap sync_thread which means
the second process will
call INIT_WORK and queue_work again.
Maybe your description is valid, I would prefer call work_pending and
flush_workqueue instead of
INIT_WORK and queue_work.
This is not enough, it's right this can avoid list corruption, but the
worker function md_start_sync just register a sync_thread, and
md_do_sync() can still in progress, hence this can't prevent a new
sync_thread to start while the old one is not done, some other problems
like deadlock can still be triggered.
Of course, our 5.10 and mainline are the same,
there are some tests:
First the deadlock can be reporduced reliably, test script is simple:
mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]
So this is raid10 while the previous problem was appeared in raid456,
I am not sure it is the same
issue, but let's see.
Ok, I'm not quite familiar with raid456 yet, however, the problem is
still related to that action_store hold mutex to unregister sync_thread,
right?
Yes and no, the previous raid456 bug also existed because it can't get
stripe while
barrier is involved as you mentioned in patch 4, which is different.
Then, the problem MD_RECOVERY_RUNNING can be cleared can't be
reporduced
reliably, usually it takes 2+ days to triggered a problem, and each
time
problem phenomenon can be different, I'm hacking the kernel and add
some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
test can trigger the BUG_ON:
Also your debug patch obviously added large delay which make the
calltrace happen, I doubt
if user can hit it in real life. Anyway, will try below test from my
side.
mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
sleep 5
echo 1 > /sys/module/md_mod/parameters/set_delay
echo idle > /sys/block/md0/md/sync_action &
sleep 5
echo "want_replacement" > /sys/block/md0/md/dev-sdd/state
Combined your debug patch with above steps. Seems you are
1. add delay to action_store, so it can't get lock in time.
2. echo "want_replacement"**triggers md_check_recovery which can grab lock
to start sync thread.
3. action_store finally hold lock to clear RECOVERY_RUNNING in reap sync
thread.
4. Then the new added BUG_ON is invoked since RECOVERY_RUNNING is cleared
in step 3.
test result:
[ 228.390237] md_check_recovery: running is set
[ 228.391376] md_check_recovery: queue new sync thread
[ 233.671041] action_store unregister success! delay 10s
[ 233.689276] md_check_recovery: running is set
[ 238.722448] md_check_recovery: running is set
[ 238.723328] md_check_recovery: queue new sync thread
[ 238.724851] md_do_sync: before new wor, sleep 10s
[ 239.725818] md_do_sync: delay done
[ 243.674828] action_store delay done
[ 243.700102] md_reap_sync_thread: running is cleared!
[ 243.748703] ------------[ cut here ]------------
[ 243.749656] kernel BUG at drivers/md/md.c:9084!
After your debug patch applied, is L9084 points to below?
9084 mddev->curr_resync = MaxSector;
In my environment, it's a BUG_ON() that I added in md_do_sync:
Ok, so we are on different code base ...
9080 skip:
9081 /* set CHANGE_PENDING here since maybe another update is
needed,
9082 ┊* so other nodes are informed. It should be harmless for
normal
9083 ┊* raid */
9084 BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
9085 set_mask_bits(&mddev->sb_flags, 0,
9086 ┊ BIT(MD_SB_CHANGE_PENDING) |
BIT(MD_SB_CHANGE_DEVS));
I don't understand how it triggers below calltrace, and it has
nothing to do with
list corruption, right?
Yes, this is just a early BUG_ON() to detect that if MD_RECOVERY_RUNNING
is cleared while sync_thread is still in progress.
sync_thread can be interrupted once MD_RECOVERY_INTR is set which means
the RUNNING
can be cleared, so I am not sure the added BUG_ON is reasonable. And
change BUG_ON
like this makes more sense to me.
+BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
+!test_bit(MD_RECOVERY_INTR, &mddev->recovery));
I think there might be racy window like you described but it should be
really small, I prefer
to just add a few lines like this instead of revert and introduce new
lock to resolve the same
issue (if it is).
@@ -4792,9 +4793,15 @@action_store(struct mddev *mddev, const char
*page, size_t len)
if (mddev->sync_thread) {
sector_t save_rp = mddev->reshape_position;
+set_bit(MD_RECOVERY_DONOT, &mddev->recovery);
@@ -4805,6 +4812,7 @@action_store(struct mddev *mddev, const char *page,
size_t len)
mddev->reshape_position = save_rp;
set_bit(MD_RECOVERY_INTR,
&mddev->recovery);
md_reap_sync_thread(mddev);
+clear_bit(MD_RECOVERY_DONOT, &mddev->recovery);
}
mddev_unlock(mddev);
@@ -9296,6 +9313,9 @@void md_check_recovery(struct mddev *mddev)
if (!md_is_rdwr(mddev) &&
!test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
return;
+/* action_store is in the middle of reap sync thread, let's wait */
+if (test_bit(MD_RECOVERY_DONOT, &mddev->recovery))
+return;
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -553,6 +553,7 @@enum recovery_flags {
MD_RECOVERY_ERROR, /* sync-action interrupted because
io-error */
MD_RECOVERY_WAIT, /* waiting for pers->start() to finish */
MD_RESYNCING_REMOTE, /* remote node is running resync thread */
+MD_RECOVERY_DONOT, /* for a nasty racy issue */
};
TBH, I am reluctant to see the changes in the series, it can only be
considered
acceptable with conditions:
1. the previous raid456 bug can be fixed in this way too, hopefully Marc
or others
can verify it.
2. pass all the tests in mdadm.
Thanks,
Guoqing