Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Fri, 15 Mar 2024 09:30:11 +0800

Hi,

在 2024/03/15 2:20, junxiao.bi@xxxxxxxxxx 写道:
On 3/12/24 6:20 PM, Yu Kuai wrote:

Hi,

在 2024/03/13 6:56, junxiao.bi@xxxxxxxxxx 写道:
On 3/10/24 6:50 PM, Yu Kuai wrote:

Hi,

在 2024/03/09 7:49, junxiao.bi@xxxxxxxxxx 写道:
Here is the root cause for this issue:

Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in 
raid5d") introduced a regression, it got reverted through commit 
bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in 
raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing, 
commit d6e035aad6c0 ("md: bypass block throttle for superblock 
update") was created, it avoids md superblock write getting 
throttled by block layer which is good, but md superblock write 
could be stuck in block layer due to block flush as well, and that 
is what was happening in this regression report.

Process "md0_reclaim" got stuck while waiting IO for md superblock 
write done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags, 
these 3 steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed 
before done, the hung of this process is because the last step 
"POSTFLUSH" never done. And that was because of  process 
"md0_raid5" submitted another IO with REQ_FUA flag marked just 
before that step started. To handle that IO, blk_insert_flush() 
will be invoked and hit "REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH" case 
where "fq->flush_data_in_flight" will be increased. When the IO for 
md superblock write was to issue "POSTFLUSH" step through 
blk_kick_flush(), it found that "fq->flush_data_in_flight" was not 
zero, so it will skip that step, that is expected, because flush 
will be triggered when "fq->flush_data_in_flight" dropped to zero.

Unfortunately here that inflight data IO from "md0_raid5" will 
never done, because it was added into the blk_plug list of that 
process, but "md0_raid5" run into infinite loop due to 
"MD_SB_CHANGE_PENDING" which made it never had a chance to finish 
the blk plug until "MD_SB_CHANGE_PENDING" was cleared. Process 
"md0_reclaim" was supposed to clear that flag but it was stuck by 
"md0_raid5", so this is a deadlock.

Looks like the approach in the RFC patch trying to resolve the 
regression of commit 5e2cf333b7bd can help this issue. Once 
"md0_raid5" starts looping due to "MD_SB_CHANGE_PENDING", it should 
release all its staging IO requests to avoid blocking others. Also 
a cond_reschedule() will avoid it run into lockup.

The analysis sounds good, however, it seems to me that the behaviour
raid5d() pings the cpu to wait for 'MD_SB_CHANGE_PENDING' to be cleared
is not reasonable, because md_check_recovery() must hold
'reconfig_mutex' to clear the flag.

That's the behavior before commit 5e2cf333b7bd which was added into 
Sep 2022, so this behavior has been with raid5 for many years.


Yes, it exists for a long time doesn't mean it's good. It is really
weird to hold spinlock to wait for a mutex.
I am confused about this, where is the code that waiting mutex while 
holding spinlock, wouldn't that cause a deadlock?

For example, assume that other contex already holding the
'reconfig_mutex', and this can be slow, then in raid5d:

md_check_recovery
 try lock 'reconfig_mutex' failed

while (1)
 hold spin_lock
 try to issue IO, failed
 release spin_lock
 blk_flush_plug
 hold spin_lock

So, untill other contex release the 'reconfig_mutex', and then
md_check_recovery() update super_block, raid5d() will not make progress, 
meanwhile it will waste one cpu.

Thanks,
Kuai



Look at raid1/raid10, there are two different behaviour that seems can
avoid this problem as well:

1) blk_start_plug() is delayed until all failed IO is handled. This 
look
reasonable because in order to get better performance, IO should be
handled by submitted thread as much as possible, and meanwhile, the
deadlock can be triggered here.
2) if 'MD_SB_CHANGE_PENDING' is not cleared by md_check_recovery(), 
skip
the handling of failed IO, and when mddev_unlock() is called, daemon
thread will be woken up again to handle failed IO.

How about the following patch?

Thanks,
Kuai

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3ad5f3c7f91e..0b2e6060f2c9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)

        md_check_recovery(mddev);

-       blk_start_plug(&plug);
        handled = 0;
        spin_lock_irq(&conf->device_lock);
        while (1) {
@@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
                int batch_size, released;
                unsigned int offset;

+               /*
+                * md_check_recovery() can't clear sb_flags, usually 
because of
+                * 'reconfig_mutex' can't be grabbed, wait for 
mddev_unlock() to
+                * wake up raid5d().
+                */
+               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+                       goto skip;
+
                released = release_stripe_list(conf, 
conf->temp_inactive_list);
                if (released)
                        clear_bit(R5_DID_ALLOC, &conf->cache_state);
@@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
spin_lock_irq(&conf->device_lock);
                }
        }
+skip:
        pr_debug("%d stripes handled\n", handled);
-
        spin_unlock_irq(&conf->device_lock);
        if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
            mutex_trylock(&conf->cache_size_mutex)) {
@@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
                mutex_unlock(&conf->cache_size_mutex);
        }

+       blk_start_plug(&plug);
        flush_deferred_bios(conf);

        r5l_flush_stripe_to_raid(conf->log);

This patch eliminated the benefit of blk_plug, i think it will not be 
good for IO performance perspective?

There is only one daemon thread, so IO should not be handled here as
much as possible. The IO should be handled by the thread that is
submitting the IO, and let daemon to hanldle the case that IO failed or
can't be submitted at that time.

I am not sure how much it will impact regarding drop the blk_plug.

Song, what's your take on this?

Thanks,

Junxiao.


Thanks,
Kuai



Thanks,

Junxiao.



https://www.spinics.net/lists/raid/msg75338.html

Dan, can you try the following patch?

diff --git a/block/blk-core.c b/block/blk-core.c
index de771093b526..474462abfbdc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug, 
bool from_schedule)
         if (unlikely(!rq_list_empty(plug->cached_rq)))
                 blk_mq_free_plug_rqs(plug);
  }
+EXPORT_SYMBOL(__blk_flush_plug);

  /**
   * blk_finish_plug - mark the end of a batch of submitted I/O
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8497880135ee..26e09cdf46a3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
spin_unlock_irq(&conf->device_lock);
                         md_check_recovery(mddev);
spin_lock_irq(&conf->device_lock);
+               } else {
+ spin_unlock_irq(&conf->device_lock);
+                       blk_flush_plug(&plug, false);
+                       cond_resched();
+ spin_lock_irq(&conf->device_lock);
                 }
         }
         pr_debug("%d stripes handled\n", handled);

Thanks,

Junxiao.

On 3/1/24 12:26 PM, junxiao.bi@xxxxxxxxxx wrote:
Hi Dan & Song,

I have not root cause this yet, but would like share some findings 
from the vmcore Dan shared. From what i can see, this doesn't look 
like a md issue, but something wrong with block layer or below.

1. There were multiple process hung by IO over 15mins.

crash> ps -m | grep UN
[0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0 CPU: 1 
COMMAND: "jbd2/dm-3-8"
[0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0 CPU: 2 
COMMAND: "dd"
[0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00 CPU: 3 
COMMAND: "md0_reclaim"
[0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740 CPU: 1 
COMMAND: "kworker/1:2"
[0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80 CPU: 7 
COMMAND: "kworker/u21:1"

2. Let pick md0_reclaim to take a look, it is waiting done 
super_block update. We can see there were two pending superblock 
write and other pending io for the underling physical disk, which 
caused these process hung.

crash> bt 876
PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: 
"md0_reclaim"
 #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
 #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
 #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
 #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
 #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
 #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
 #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
 #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
 #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1

crash> mddev.pending_writes,disks 0xffff888108335800
  pending_writes = {
    counter = 2  <<<<<<<<<< 2 active super block write
  },
  disks = {
    next = 0xffff88810ce85a00,
    prev = 0xffff88810ce84c00
  },
crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending 
0xffff88810ce85a00
ffff88810ce85a00
  kobj.name = 0xffff8881067c1a00 "dev-dm-1",
  nr_pending = {
    counter = 0
  },
ffff8881083ace00
  kobj.name = 0xffff888100a93280 "dev-sde",
  nr_pending = {
    counter = 10 <<<<
  },
ffff8881010ad200
  kobj.name = 0xffff8881012721c8 "dev-sdc",
  nr_pending = {
    counter = 8 <<<<<
  },
ffff88810ce84c00
  kobj.name = 0xffff888100325f08 "dev-sdd",
  nr_pending = {
    counter = 2 <<<<<
  },

3. From block layer, i can find the inflight IO for md superblock 
write which has been pending 955s which matches with the hung time 
of "md0_reclaim"

crash> 
request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io 
ffff888103b4c300
  q = 0xffff888103a00d80,
  mq_hctx = 0xffff888103c5d200,
  cmd_flags = 38913,
  rq_flags = 139408,
  start_time_ns = 1504179024146,
  bio = 0x0,
  biotail = 0xffff888120758e40,
  state = MQ_RQ_COMPLETE,
  __data_len = 0,
  flush = {
    seq = 3, <<<< REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA
    saved_end_io = 0x0
  },
  end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,

crash> p tk_core.timekeeper.tkr_mono.base
$1 = 2459916243002
crash> eval 2459916243002-1504179024146
hexadecimal: de86609f28
    decimal: 955737218856  <<<<<<< IO pending time is 955s
      octal: 15720630117450
     binary: 
0000000000000000000000001101111010000110011000001001111100101000

crash> bio.bi_iter,bi_end_io 0xffff888120758e40
  bi_iter = {
    bi_sector = 8, <<<< super block offset
    bi_size = 0,
    bi_idx = 0,
    bi_bvec_done = 0
  },
  bi_end_io = 0xffffffff817dca50 <super_written>,
crash> dev -d | grep ffff888103a00d80
    8 ffff8881003ab000   sdd        ffff888103a00d80 0 0 0

4. Check above request, even its state is "MQ_RQ_COMPLETE", but it 
is still pending. That's because each md superblock write was 
marked with REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 
steps: pre_flush, data, and post_flush. Once each step complete, 
it will be marked in "request.flush.seq", here the value is 3, 
which is REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA, so the last step 
"post_flush" has not be done. Another wired thing is that 
blk_flush_queue.flush_data_in_flight is still 1 even "data" step 
already done.

crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
  fq = 0xffff88810332e240,
crash> blk_flush_queue 0xffff88810332e240
struct blk_flush_queue {
  mq_flush_lock = {
    {
      rlock = {
        raw_lock = {
          {
            val = {
              counter = 0
            },
            {
              locked = 0 '\000',
              pending = 0 '\000'
            },
            {
              locked_pending = 0,
              tail = 0
            }
          }
        }
      }
    }
  },
  flush_pending_idx = 1,
  flush_running_idx = 1,
  rq_status = 0 '\000',
  flush_pending_since = 4296171408,
  flush_queue = {{
      next = 0xffff88810332e250,
      prev = 0xffff88810332e250
    }, {
      next = 0xffff888103b4c348, <<<< the request is in this list
      prev = 0xffff888103b4c348
    }},
  flush_data_in_flight = 1,  >>>>>> still 1
  flush_rq = 0xffff888103c2e000
}

crash> list 0xffff888103b4c348
ffff888103b4c348
ffff88810332e260

crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of 
hw queue
  tag = -1,
  state = MQ_RQ_IDLE,
  ref = {
    counter = 0
  },

5. Looks like the block layer or underlying(scsi/virtio-scsi) may 
have some issue which leading to the io request from md layer 
stayed in a partial complete statue. I can't see how this can be 
related with the commit bed9e27baf52 ("Revert "md/raid5: Wait for 
MD_SB_CHANGE_PENDING in raid5d"")


Dan,

Are you able to reproduce using some regular scsi disk, would like 
to rule out whether this is related with virtio-scsi?

And I see the kernel version is 6.8.0-rc5 from vmcore, is this the 
official mainline v6.8-rc5 without any other patches?


Thanks,

Junxiao.

On 2/23/24 6:13 PM, Song Liu wrote:
Hi,

On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
Leemhuis) <regressions@xxxxxxxxxxxxx> wrote:
On 21.02.24 00:06, Dan Moulding wrote:
Just a friendly reminder that this regression still exists on the
mainline. It has been reverted in 6.7 stable. But I upgraded a
development system to 6.8-rc5 today and immediately hit this issue
again. Then I saw that it hasn't yet been reverted in Linus' tree.
Song Liu, what's the status here? I aware that you fixed with 
quite a
few regressions recently, but it seems like resolving this one is
stalled. Or were you able to reproduce the issue or make some 
progress
and I just missed it?
Sorry for the delay with this issue. I have been occupied with some
other stuff this week.

I haven't got luck to reproduce this issue. I will spend more 
time looking
into it next week.

And if not, what's the way forward here wrt to the release of 6.8?
Revert the culprit and try again later? Or is that not an option 
for one
reason or another?
If we don't make progress with it in the next week, we will do 
the revert,
same as we did with stable kernels.

Or do we assume that this is not a real issue? That it's caused 
by some
oddity (bit-flip in the metadata or something like that?) only 
to be
found in Dan's setup?
I don't think this is because of oddities. Hopefully we can get more
information about this soon.

Thanks,
Song

Ciao, Thorsten (wearing his 'the Linux kernel's regression 
tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that 
page.

#regzbot poke


.


.


.