Re: Reshape Failure

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Sun, 10 Sep 2023 14:10:54 +0800

Hi,

[cc linux-block]
在 2023/09/10 12:58, Jason Moss 写道:
Hi,

On Sat, Sep 9, 2023 at 7:45 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

Hi,

在 2023/09/07 14:19, Jason Moss 写道:
Hi,

On Wed, Sep 6, 2023 at 11:13 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

Hi,

在 2023/09/07 13:44, Jason Moss 写道:
Hi,

On Wed, Sep 6, 2023 at 6:38 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

Hi,

在 2023/09/06 22:05, Jason Moss 写道:
Hi Kuai,

I ended up using gdb rather than addr2line, as that output didn't give
me the global offset. Maybe there's a better way, but this seems to be
similar to what I expected.

It's ok.

(gdb) list *(reshape_request+0x416)
0x11566 is in reshape_request (drivers/md/raid5.c:6396).
6391            if ((mddev->reshape_backwards
6392                 ? (safepos > writepos && readpos < writepos)
6393                 : (safepos < writepos && readpos > writepos)) ||
6394                time_after(jiffies, conf->reshape_checkpoint + 10*HZ)) {
6395                    /* Cannot proceed until we've updated the
superblock... */
6396                    wait_event(conf->wait_for_overlap,
6397                               atomic_read(&conf->reshape_stripes)==0
6398                               || test_bit(MD_RECOVERY_INTR,

If reshape is stuck here, which means:

1) Either reshape io is stuck somewhere and never complete;
2) Or the counter reshape_stripes is broken;

Can you read following debugfs files to verify if io is stuck in
underlying disk?

/sys/kernel/debug/block/[disk]/hctx*/{sched_tags,tags,busy,dispatch}

I'll attach this below.

Furthermore, echo frozen should break above wait_event() because
'MD_RECOVERY_INTR' will be set, however, based on your description,
the problem still exist. Can you collect stack and addr2line result
of stuck thread after echo frozen?

I echo'd frozen to /sys/block/md0/md/sync_action, however the echo
call has been sitting for about 30 minutes, maybe longer, and has not
returned. Here's the current state:

root         454  0.0  0.0      0     0 ?        I<   Sep05   0:00 [raid5wq]
root         455  0.0  0.0  34680  5988 ?        D    Sep05   0:00 (udev-worker)

Can you also show the stack of udev-worker? And any other thread with
'D' state, I think above "echo frozen" is probably also stuck in D
state.

As requested:

ps aux | grep D
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         455  0.0  0.0  34680  5988 ?        D    Sep05   0:00 (udev-worker)
root         457  0.0  0.0      0     0 ?        D    Sep05   0:00 [md0_reshape]
root       45507  0.0  0.0   8272  4736 pts/1    Ds+  Sep05   0:00 -bash
jason     279169  0.0  0.0   6976  2560 pts/0    S+   23:16   0:00
grep --color=auto D

[jason@arch md]$ sudo cat /proc/455/stack
[<0>] wait_woken+0x54/0x60
[<0>] raid5_make_request+0x5fe/0x12f0 [raid456]
[<0>] md_handle_request+0x135/0x220 [md_mod]
[<0>] __submit_bio+0xb3/0x170
[<0>] submit_bio_noacct_nocheck+0x159/0x370
[<0>] block_read_full_folio+0x21c/0x340
[<0>] filemap_read_folio+0x40/0xd0
[<0>] filemap_get_pages+0x475/0x630
[<0>] filemap_read+0xd9/0x350
[<0>] blkdev_read_iter+0x6b/0x1b0
[<0>] vfs_read+0x201/0x350
[<0>] ksys_read+0x6f/0xf0
[<0>] do_syscall_64+0x60/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

[jason@arch md]$ sudo cat /proc/45507/stack
[<0>] kthread_stop+0x6a/0x180
[<0>] md_unregister_thread+0x29/0x60 [md_mod]
[<0>] action_store+0x168/0x320 [md_mod]
[<0>] md_attr_store+0x86/0xf0 [md_mod]
[<0>] kernfs_fop_write_iter+0x136/0x1d0
[<0>] vfs_write+0x23e/0x420
[<0>] ksys_write+0x6f/0xf0
[<0>] do_syscall_64+0x60/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Please let me know if you'd like me to identify the lines for any of those.

That's enough.
Thanks,
Jason

root         456 99.9  0.0      0     0 ?        R    Sep05 1543:40 [md0_raid6]
root         457  0.0  0.0      0     0 ?        D    Sep05   0:00 [md0_reshape]

[jason@arch md]$ sudo cat /proc/457/stack
[<0>] md_do_sync+0xef2/0x11d0 [md_mod]
[<0>] md_thread+0xae/0x190 [md_mod]
[<0>] kthread+0xe8/0x120
[<0>] ret_from_fork+0x34/0x50
[<0>] ret_from_fork_asm+0x1b/0x30

Reading symbols from md-mod.ko...
(gdb) list *(md_do_sync+0xef2)
0xb3a2 is in md_do_sync (drivers/md/md.c:9035).
9030                    ? "interrupted" : "done");
9031            /*
9032             * this also signals 'finished resyncing' to md_stop
9033             */
9034            blk_finish_plug(&plug);
9035            wait_event(mddev->recovery_wait,
!atomic_read(&mddev->recovery_active));

That's also wait for reshape io to be done from common layer.

9036
9037            if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
9038                !test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
9039                mddev->curr_resync >= MD_RESYNC_ACTIVE) {

The debugfs info:

[root@arch ~]# cat
/sys/kernel/debug/block/sda/hctx*/{sched_tags,tags,busy,dispatch}

Only sched_tags is read, sorry that I didn't mean to use this exact cmd.

Perhaps you can using following cmd:

find /sys/kernel/debug/block/sda/ -type f | xargs grep .

nr_tags=64
nr_reserved_tags=0
active_queues=0

bitmap_tags:
depth=64
busy=1

This means there is one IO in sda, however, I need more information to
make sure where is this IO. And please make sure don't run any other
thread that can read/write from sda. You can use "iostat -dmx 1" and
observe for a while to confirm that there is no new io.

And can you help for this? Confirm no new io and collect debugfs.

As instructed, I confirmed there is no active IO to sda1 via iostat. I
then ran the provided command

[root@arch ~]# find /sys/kernel/debug/block/sda/ -type f | xargs grep .
/sys/kernel/debug/block/sda/rqos/wbt/wb_background:6
/sys/kernel/debug/block/sda/rqos/wbt/wb_normal:12
/sys/kernel/debug/block/sda/rqos/wbt/unknown_cnt:4
/sys/kernel/debug/block/sda/rqos/wbt/min_lat_nsec:75000000
/sys/kernel/debug/block/sda/rqos/wbt/inflight:0: inflight 1
/sys/kernel/debug/block/sda/rqos/wbt/inflight:1: inflight 0
/sys/kernel/debug/block/sda/rqos/wbt/inflight:2: inflight 0
/sys/kernel/debug/block/sda/rqos/wbt/id:0
/sys/kernel/debug/block/sda/rqos/wbt/enabled:1
/sys/kernel/debug/block/sda/rqos/wbt/curr_win_nsec:100000000
/sys/kernel/debug/block/sda/hctx0/type:default
/sys/kernel/debug/block/sda/hctx0/dispatch_busy:0
/sys/kernel/debug/block/sda/hctx0/active:0
/sys/kernel/debug/block/sda/hctx0/run:2583
/sys/kernel/debug/block/sda/hctx0/sched_tags_bitmap:00000000: 0000
0000 8000 0000
/sys/kernel/debug/block/sda/hctx0/sched_tags:nr_tags=64
/sys/kernel/debug/block/sda/hctx0/sched_tags:nr_reserved_tags=0
/sys/kernel/debug/block/sda/hctx0/sched_tags:active_queues=0
/sys/kernel/debug/block/sda/hctx0/sched_tags:bitmap_tags:
/sys/kernel/debug/block/sda/hctx0/sched_tags:depth=64
/sys/kernel/debug/block/sda/hctx0/sched_tags:busy=1
sched_tags:busy is 1 indicate this io made to the elevator. Which means
this problem is not related to raid,io issued to sda never return.

/sys/kernel/debug/block/sda/hctx0/sched_tags:cleared=57
/sys/kernel/debug/block/sda/hctx0/sched_tags:bits_per_word=16
/sys/kernel/debug/block/sda/hctx0/sched_tags:map_nr=4
/sys/kernel/debug/block/sda/hctx0/sched_tags:alloc_hint={40, 20, 48, 0}
/sys/kernel/debug/block/sda/hctx0/sched_tags:wake_batch=8
/sys/kernel/debug/block/sda/hctx0/sched_tags:wake_index=0
/sys/kernel/debug/block/sda/hctx0/sched_tags:ws_active=0
/sys/kernel/debug/block/sda/hctx0/sched_tags:ws={
/sys/kernel/debug/block/sda/hctx0/sched_tags:   {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/sched_tags:   {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/sched_tags:   {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/sched_tags:   {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/sched_tags:   {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/sched_tags:   {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/sched_tags:   {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/sched_tags:   {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/sched_tags:}
/sys/kernel/debug/block/sda/hctx0/sched_tags:round_robin=1
/sys/kernel/debug/block/sda/hctx0/sched_tags:min_shallow_depth=48
/sys/kernel/debug/block/sda/hctx0/tags_bitmap:00000000: 0000 0000
/sys/kernel/debug/block/sda/hctx0/tags:nr_tags=32
/sys/kernel/debug/block/sda/hctx0/tags:nr_reserved_tags=0
/sys/kernel/debug/block/sda/hctx0/tags:active_queues=0
/sys/kernel/debug/block/sda/hctx0/tags:bitmap_tags:
/sys/kernel/debug/block/sda/hctx0/tags:depth=32
/sys/kernel/debug/block/sda/hctx0/tags:busy=0
sched_tags:busy is 0 indicate this io didn't make to the driver. So io
is still in block layer, likely still in elevator.

Which elevator you are using? You can confirm by:

cat /sys/block/sda/queue/scheduler

It's likely mq-deadline, anyway, can you switch to other elevator before
assemble the array and retry to test if you can still reporduce the
problem?

Thanks,
Kuai

/sys/kernel/debug/block/sda/hctx0/tags:cleared=21
/sys/kernel/debug/block/sda/hctx0/tags:bits_per_word=8
/sys/kernel/debug/block/sda/hctx0/tags:map_nr=4
/sys/kernel/debug/block/sda/hctx0/tags:alloc_hint={19, 26, 7, 21}
/sys/kernel/debug/block/sda/hctx0/tags:wake_batch=4
/sys/kernel/debug/block/sda/hctx0/tags:wake_index=0
/sys/kernel/debug/block/sda/hctx0/tags:ws_active=0
/sys/kernel/debug/block/sda/hctx0/tags:ws={
/sys/kernel/debug/block/sda/hctx0/tags: {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/tags: {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/tags: {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/tags: {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/tags: {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/tags: {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/tags: {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/tags: {.wait=inactive},
/sys/kernel/debug/block/sda/hctx0/tags:}
/sys/kernel/debug/block/sda/hctx0/tags:round_robin=1
/sys/kernel/debug/block/sda/hctx0/tags:min_shallow_depth=4294967295
/sys/kernel/debug/block/sda/hctx0/ctx_map:00000000: 00
/sys/kernel/debug/block/sda/hctx0/flags:alloc_policy=RR SHOULD_MERGE
/sys/kernel/debug/block/sda/sched/queued:0 0 0
/sys/kernel/debug/block/sda/sched/owned_by_driver:0 0 0
/sys/kernel/debug/block/sda/sched/async_depth:48
/sys/kernel/debug/block/sda/sched/starved:0
/sys/kernel/debug/block/sda/sched/batching:2
/sys/kernel/debug/block/sda/state:SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|NOWAIT|SQ_SCHED
/sys/kernel/debug/block/sda/pm_only:0

Let me know if there's anything further I can provide to assist in
troubleshooting.