Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Fri, 15 Mar 2024 09:17:56 +0800

Hi,

在 2024/03/15 0:12, Dan Moulding 写道:
How about the following patch?

Thanks,
Kuai

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3ad5f3c7f91e..0b2e6060f2c9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)

          md_check_recovery(mddev);

-       blk_start_plug(&plug);
          handled = 0;
          spin_lock_irq(&conf->device_lock);
          while (1) {
@@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
                  int batch_size, released;
                  unsigned int offset;

+               /*
+                * md_check_recovery() can't clear sb_flags, usually
because of
+                * 'reconfig_mutex' can't be grabbed, wait for
mddev_unlock() to
+                * wake up raid5d().
+                */
+               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+                       goto skip;
+
                  released = release_stripe_list(conf,
conf->temp_inactive_list);
                  if (released)
                          clear_bit(R5_DID_ALLOC, &conf->cache_state);
@@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
                          spin_lock_irq(&conf->device_lock);
                  }
          }
+skip:
          pr_debug("%d stripes handled\n", handled);
-
          spin_unlock_irq(&conf->device_lock);
          if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
              mutex_trylock(&conf->cache_size_mutex)) {
@@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
                  mutex_unlock(&conf->cache_size_mutex);
          }

+       blk_start_plug(&plug);
          flush_deferred_bios(conf);

          r5l_flush_stripe_to_raid(conf->log);

I can confirm that this patch also works. I'm unable to reproduce the
hang after applying this instead of the first patch provided by
Junxiao. So looks like both ways are succesful in avoiding the hang.


Thanks a lot for the testing! Can you also give following patch a try?
It removes the change to blk_plug, because Dan and Song are worried
about performance degradation, so we need to verify the performance
before consider that patch.

Anyway, I think following patch can fix this problem as well.

Thanks,
Kuai

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3ad5f3c7f91e..ae8665be9940 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6728,6 +6728,9 @@ static void raid5d(struct md_thread *thread)
                int batch_size, released;
                unsigned int offset;

+               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+                       goto skip;
+
                released = release_stripe_list(conf, 
conf->temp_inactive_list);
                if (released)
                        clear_bit(R5_DID_ALLOC, &conf->cache_state);
@@ -6766,6 +6769,7 @@ static void raid5d(struct md_thread *thread)
                        spin_lock_irq(&conf->device_lock);
                }
        }
+skip:
        pr_debug("%d stripes handled\n", handled);

        spin_unlock_irq(&conf->device_lock);


-- Dan
.