Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 3/23/23 09:36, Yu Kuai wrote:
Hi,

在 2023/03/22 22:32, Guoqing Jiang 写道:
Could you explain how the same work can be re-queued? Isn't the PENDING_BIT is already set in t3? I believe queue_work shouldn't do that per the comment
but I am not expert ...

This is not related to workqueue, it is just because raid10
reinitialize the work that is already queued,

I am trying to understand the possibility.

like I discribed later in t3:

t2:
md_check_recovery:
 INIT_WORK -> clear pending
 queue_work -> set pending
  list_add_tail
...

t3: -> work is still pending
md_check_recovery:
 INIT_WORK -> clear pending
 queue_work -> set pending
  list_add_tail -> list is corrupted

First, t2 and t3 can't be run in parallel since reconfig_mutex must be held. And if sync_thread existed, the second process would unregister and reap sync_thread which means the second process will
call INIT_WORK and queue_work again.

Maybe your description is valid, I would prefer call work_pending and flush_workqueue instead of
INIT_WORK and queue_work.

This is not enough, it's right this can avoid list corruption, but the
worker function md_start_sync just register a sync_thread, and
md_do_sync() can still in progress, hence this can't prevent a new
sync_thread to start while the old one is not done, some other problems
like deadlock can still be triggered.

Of course, our 5.10 and mainline are the same,

there are some tests:

First the deadlock can be reporduced reliably, test script is simple:

mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]

So this is raid10 while the previous problem was appeared in raid456, I am not sure it is the same
issue, but let's see.

Ok, I'm not quite familiar with raid456 yet, however, the problem is
still related to that action_store hold mutex to unregister sync_thread,
right?

Yes and no, the previous raid456 bug also existed because it can't get stripe while
barrier is involved as you mentioned in patch 4, which is different.


Then, the problem MD_RECOVERY_RUNNING can be cleared can't be reporduced reliably, usually it takes 2+ days to triggered a problem, and each time
problem phenomenon can be different, I'm hacking the kernel and add
some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
test can trigger the BUG_ON:

Also your debug patch obviously added large delay which make the calltrace happen, I doubt if user can hit it in real life. Anyway, will try below test from my side.

mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
sleep 5
echo 1 > /sys/module/md_mod/parameters/set_delay
echo idle > /sys/block/md0/md/sync_action &
sleep 5
echo "want_replacement" > /sys/block/md0/md/dev-sdd/state

Combined your debug patch with above steps. Seems you are

1. add delay to action_store, so it can't get lock in time.
2. echo "want_replacement"**triggers md_check_recovery which can grab lock
    to start sync thread.
3. action_store finally hold lock to clear RECOVERY_RUNNING in reap sync thread.
4. Then the new added BUG_ON is invoked since RECOVERY_RUNNING is cleared
    in step 3.


test result:

[  228.390237] md_check_recovery: running is set
[  228.391376] md_check_recovery: queue new sync thread
[  233.671041] action_store unregister success! delay 10s
[  233.689276] md_check_recovery: running is set
[  238.722448] md_check_recovery: running is set
[  238.723328] md_check_recovery: queue new sync thread
[  238.724851] md_do_sync: before new wor, sleep 10s
[  239.725818] md_do_sync: delay done
[  243.674828] action_store delay done
[  243.700102] md_reap_sync_thread: running is cleared!
[  243.748703] ------------[ cut here ]------------
[  243.749656] kernel BUG at drivers/md/md.c:9084!

After your debug patch applied, is L9084 points to below?

9084                                 mddev->curr_resync = MaxSector;

In my environment, it's a BUG_ON() that I added in md_do_sync:

Ok, so we are on different code base ...

9080  skip:
9081         /* set CHANGE_PENDING here since maybe another update is needed, 9082         ┊* so other nodes are informed. It should be harmless for normal
9083         ┊* raid */
9084         BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
9085         set_mask_bits(&mddev->sb_flags, 0,
9086                 ┊     BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));


I don't understand how it triggers below calltrace, and it has nothing to do with
list corruption, right?

Yes, this is just a early BUG_ON() to detect that if MD_RECOVERY_RUNNING
is cleared while sync_thread is still in progress.

sync_thread can be interrupted once MD_RECOVERY_INTR is set which means the RUNNING can be cleared, so I am not sure the added BUG_ON is reasonable. And change BUG_ON
like this makes more sense to me.

+BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
+!test_bit(MD_RECOVERY_INTR, &mddev->recovery));

I think there might be racy window like you described but it should be really small, I prefer to just add a few lines like this instead of revert and introduce new lock to resolve the same
issue (if it is).

@@ -4792,9 +4793,15 @@action_store(struct mddev *mddev, const char *page, size_t len)
                       if (mddev->sync_thread) {
                               sector_t save_rp = mddev->reshape_position;

+set_bit(MD_RECOVERY_DONOT, &mddev->recovery);
@@ -4805,6 +4812,7 @@action_store(struct mddev *mddev, const char *page, size_t len)
                               mddev->reshape_position = save_rp;
                               set_bit(MD_RECOVERY_INTR, &mddev->recovery);
                               md_reap_sync_thread(mddev);
+clear_bit(MD_RECOVERY_DONOT, &mddev->recovery);
                       }
                       mddev_unlock(mddev);
@@ -9296,6 +9313,9 @@void md_check_recovery(struct mddev *mddev)
       if (!md_is_rdwr(mddev) &&
           !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
               return;
+/* action_store is in the middle of reap sync thread, let's wait */
+if (test_bit(MD_RECOVERY_DONOT, &mddev->recovery))
+return;
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -553,6 +553,7 @@enum recovery_flags {
       MD_RECOVERY_ERROR,      /* sync-action interrupted because io-error */
       MD_RECOVERY_WAIT,       /* waiting for pers->start() to finish */
       MD_RESYNCING_REMOTE,    /* remote node is running resync thread */
+MD_RECOVERY_DONOT,     /* for a nasty racy issue */
};

TBH, I am reluctant to see the changes in the series, it can only be considered
acceptable with conditions:

1. the previous raid456 bug can be fixed in this way too, hopefully Marc or others
    can verify it.
2. pass all the tests in mdadm.

Thanks,
Guoqing



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux