Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

Guoqing Jiang <guoqing.jiang@xxxxxxxxx> · Thu, 23 Mar 2023 11:50:48 +0800

On 3/23/23 09:36, Yu Kuai wrote:
Hi,

在 2023/03/22 22:32, Guoqing Jiang 写道:
Could you explain how the same work can be re-queued? Isn't the 
PENDING_BIT
is already set in t3? I believe queue_work shouldn't do that per 
the comment
but I am not expert ...

This is not related to workqueue, it is just because raid10
reinitialize the work that is already queued, 

I am trying to understand the possibility.

like I discribed later in t3:

t2:
md_check_recovery:
 INIT_WORK -> clear pending
 queue_work -> set pending
  list_add_tail
...

t3: -> work is still pending
md_check_recovery:
 INIT_WORK -> clear pending
 queue_work -> set pending
  list_add_tail -> list is corrupted

First, t2 and t3 can't be run in parallel since reconfig_mutex must 
be held. And if sync_thread existed,
the second process would unregister and reap sync_thread which means 
the second process will
call INIT_WORK and queue_work again.

Maybe your description is valid, I would prefer call work_pending and 
flush_workqueue instead of
INIT_WORK and queue_work.

This is not enough, it's right this can avoid list corruption, but the
worker function md_start_sync just register a sync_thread, and
md_do_sync() can still in progress, hence this can't prevent a new
sync_thread to start while the old one is not done, some other problems
like deadlock can still be triggered.

Of course, our 5.10 and mainline are the same,

there are some tests:

First the deadlock can be reporduced reliably, test script is simple:

mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]

So this is raid10 while the previous problem was appeared in raid456, 
I am not sure it is the same
issue, but let's see.

Ok, I'm not quite familiar with raid456 yet, however, the problem is
still related to that action_store hold mutex to unregister sync_thread,
right?

Yes and no, the previous raid456 bug also existed because it can't get 
stripe while
barrier is involved as you mentioned in patch 4, which is different.


Then, the problem MD_RECOVERY_RUNNING can be cleared can't be 
reporduced
reliably, usually it takes 2+ days to triggered a problem, and each 
time
problem phenomenon can be different, I'm hacking the kernel and add
some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
test can trigger the BUG_ON:

Also your debug patch obviously added large delay which make the 
calltrace happen, I doubt
if user can hit it in real life. Anyway, will try below test from my 
side.

mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
sleep 5
echo 1 > /sys/module/md_mod/parameters/set_delay
echo idle > /sys/block/md0/md/sync_action &
sleep 5
echo "want_replacement" > /sys/block/md0/md/dev-sdd/state

Combined your debug patch with above steps. Seems you are

1. add delay to action_store, so it can't get lock in time.
2. echo "want_replacement"**triggers md_check_recovery which can grab lock
    to start sync thread.
3. action_store finally hold lock to clear RECOVERY_RUNNING in reap sync 
thread.
4. Then the new added BUG_ON is invoked since RECOVERY_RUNNING is cleared
    in step 3.


test result:

[  228.390237] md_check_recovery: running is set
[  228.391376] md_check_recovery: queue new sync thread
[  233.671041] action_store unregister success! delay 10s
[  233.689276] md_check_recovery: running is set
[  238.722448] md_check_recovery: running is set
[  238.723328] md_check_recovery: queue new sync thread
[  238.724851] md_do_sync: before new wor, sleep 10s
[  239.725818] md_do_sync: delay done
[  243.674828] action_store delay done
[  243.700102] md_reap_sync_thread: running is cleared!
[  243.748703] ------------[ cut here ]------------
[  243.749656] kernel BUG at drivers/md/md.c:9084!

After your debug patch applied, is L9084 points to below?

9084                                 mddev->curr_resync = MaxSector;

In my environment, it's a BUG_ON() that I added in md_do_sync:

Ok, so we are on different code base ...

9080  skip:
9081         /* set CHANGE_PENDING here since maybe another update is 
needed,
9082         ┊* so other nodes are informed. It should be harmless for 
normal
9083         ┊* raid */
9084         BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
9085         set_mask_bits(&mddev->sb_flags, 0,
9086                 ┊     BIT(MD_SB_CHANGE_PENDING) | 
BIT(MD_SB_CHANGE_DEVS));


I don't understand how it triggers below calltrace, and it has 
nothing to do with
list corruption, right?

Yes, this is just a early BUG_ON() to detect that if MD_RECOVERY_RUNNING
is cleared while sync_thread is still in progress.

sync_thread can be interrupted once MD_RECOVERY_INTR is set which means 
the RUNNING
can be cleared, so I am not sure the added BUG_ON is reasonable. And 
change BUG_ON
like this makes more sense to me.

+BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
+!test_bit(MD_RECOVERY_INTR, &mddev->recovery));

I think there might be racy window like you described but it should be 
really small, I prefer
to just add a few lines like this instead of revert and introduce new 
lock to resolve the same
issue (if it is).

@@ -4792,9 +4793,15 @@action_store(struct mddev *mddev, const char 
*page, size_t len)
                       if (mddev->sync_thread) {
                               sector_t save_rp = mddev->reshape_position;

+set_bit(MD_RECOVERY_DONOT, &mddev->recovery);
@@ -4805,6 +4812,7 @@action_store(struct mddev *mddev, const char *page, 
size_t len)
                               mddev->reshape_position = save_rp;
                               set_bit(MD_RECOVERY_INTR, 
&mddev->recovery);
                               md_reap_sync_thread(mddev);
+clear_bit(MD_RECOVERY_DONOT, &mddev->recovery);
                       }
                       mddev_unlock(mddev);
@@ -9296,6 +9313,9 @@void md_check_recovery(struct mddev *mddev)
       if (!md_is_rdwr(mddev) &&
           !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
               return;
+/* action_store is in the middle of reap sync thread, let's wait */
+if (test_bit(MD_RECOVERY_DONOT, &mddev->recovery))
+return;

--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -553,6 +553,7 @@enum recovery_flags {
       MD_RECOVERY_ERROR,      /* sync-action interrupted because 
io-error */
       MD_RECOVERY_WAIT,       /* waiting for pers->start() to finish */
       MD_RESYNCING_REMOTE,    /* remote node is running resync thread */
+MD_RECOVERY_DONOT,     /* for a nasty racy issue */
};

TBH, I am reluctant to see the changes in the series, it can only be 
considered
acceptable with conditions:

1. the previous raid456 bug can be fixed in this way too, hopefully Marc 
or others
    can verify it.
2. pass all the tests in mdadm.

Thanks,
Guoqing