Re: raid5 hang on kernel v6.1 in combination with ext4lazyinit

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Thu, 16 May 2024 19:20:34 +0800

Hi,

在 2024/05/16 17:36, Gustav Ekelund 写道:
On 5/16/24 03:10, Yu Kuai wrote:
Hi,

在 2024/05/15 19:57, Gustav Ekelund 写道:
Hi,

With raid5 syncing and ext4lazyinit running in parallel, I have a high
probability of hanging on the 6.1.55 kernel (Log from blocked tasks
below). I do not see this problem on the 5.10 kernel.

In thread [4] patch [2] is described an regression going from 6.7 to
6.7.1, so it is unclear to me if this is the same issue. Let me know if
I should reply on [4] if you think this could be the same issue.

Cherry-picking [2] into 6.1 seems to resolve the hang, but following
your discussion in [4] you later revert this patch in [3]. I tried to
follow the thread, but I cannot figure out which patch is suggested to
be used instead of [2].

Would you advice against running with [2] on v6.1? Should it be used in
combination with [1] in that case?

No, you should try this patch:

https://lore.kernel.org/all/20240322081005.1112401-1-yukuai1@xxxxxxxxxxxxxxx/

Thanks,
Kuai

Best regards
Gustav

[1] commit d6e035aad6c0 ("md: bypass block throttle for superblock
update")
[2] commit bed9e27baf52 ("Revert "md/raid5: Wait for
MD_SB_CHANGE_PENDING in raid5d"")
[3] commit 3445139e3a59 ("Revert "Revert "md/raid5: Wait for
MD_SB_CHANGE_PENDING in raid5d""")
[4]
https://lore.kernel.org/linux-raid/20240123005700.9302-1-dan@xxxxxxxx/

<6>[ 5487.973655][ T9272] sysrq: Show Blocked State
<6>[ 5487.974388][ T9272] task:md127_raid5     state:D stack:0
pid:2619  ppid:2      flags:0x00000008
<6>[ 5487.983896][ T9272] Call trace:
<6>[ 5487.987135][ T9272]  __switch_to+0xc0/0x100
<6>[ 5487.991406][ T9272]  __schedule+0x2a0/0x6b0
<6>[ 5487.995742][ T9272]  schedule+0x54/0xb4
<6>[ 5487.999658][ T9272]  raid5d+0x358/0x56c
<6>[ 5488.003576][ T9272]  md_thread+0xa8/0x15c
<6>[ 5488.007723][ T9272]  kthread+0x104/0x110
<6>[ 5488.011725][ T9272]  ret_from_fork+0x10/0x20
<6>[ 5488.016080][ T9272] task:md127_resync    state:D stack:0
pid:2620  ppid:2      flags:0x00000008
<6>[ 5488.025278][ T9272] Call trace:
<6>[ 5488.028491][ T9272]  __switch_to+0xc0/0x100
<6>[ 5488.032813][ T9272]  __schedule+0x2a0/0x6b0
<6>[ 5488.037075][ T9272]  schedule+0x54/0xb4
<6>[ 5488.041047][ T9272]  raid5_get_active_stripe+0x1f4/0x454
<6>[ 5488.046441][ T9272]  raid5_sync_request+0x350/0x390
<6>[ 5488.051401][ T9272]  md_do_sync+0x8ac/0xcc4
<6>[ 5488.055722][ T9272]  md_thread+0xa8/0x15c
<6>[ 5488.059812][ T9272]  kthread+0x104/0x110
<6>[ 5488.063814][ T9272]  ret_from_fork+0x10/0x20
<6>[ 5488.068225][ T9272] task:jbd2/md127-8    state:D stack:0
pid:2675  ppid:2      flags:0x00000008
<6>[ 5488.077425][ T9272] Call trace:
<6>[ 5488.080641][ T9272]  __switch_to+0xc0/0x100
<6>[ 5488.084906][ T9272]  __schedule+0x2a0/0x6b0
<6>[ 5488.089221][ T9272]  schedule+0x54/0xb4
<6>[ 5488.093135][ T9272]  md_write_start+0xfc/0x360
<6>[ 5488.097676][ T9272]  raid5_make_request+0x68/0x117c
<6>[ 5488.102695][ T9272]  md_handle_request+0x21c/0x354
<6>[ 5488.107565][ T9272]  md_submit_bio+0x74/0xb0
<6>[ 5488.111987][ T9272]  __submit_bio+0x100/0x27c
<6>[ 5488.116432][ T9272]  submit_bio_noacct_nocheck+0xdc/0x260
<6>[ 5488.121910][ T9272]  submit_bio_noacct+0x128/0x2e4
<6>[ 5488.126840][ T9272]  submit_bio+0x34/0xdc
<6>[ 5488.130935][ T9272]  submit_bh_wbc+0x120/0x170
<6>[ 5488.135521][ T9272]  submit_bh+0x14/0x20
<6>[ 5488.139527][ T9272]  jbd2_journal_commit_transaction+0xccc/0x1520
[jbd2]
<6>[ 5488.146400][ T9272]  kjournald2+0xb0/0x250 [jbd2]
<6>[ 5488.151194][ T9272]  kthread+0x104/0x110
<6>[ 5488.155198][ T9272]  ret_from_fork+0x10/0x20
<6>[ 5488.159608][ T9272] task:ext4lazyinit    state:D stack:0
pid:2677  ppid:2      flags:0x00000008
<6>[ 5488.168811][ T9272] Call trace:
<6>[ 5488.172026][ T9272]  __switch_to+0xc0/0x100
<6>[ 5488.176291][ T9272]  __schedule+0x2a0/0x6b0
<6>[ 5488.180618][ T9272]  schedule+0x54/0xb4
<6>[ 5488.184538][ T9272]  io_schedule+0x3c/0x60
<6>[ 5488.188714][ T9272]  bit_wait_io+0x18/0x70
<6>[ 5488.192947][ T9272]  __wait_on_bit+0x50/0x170
<6>[ 5488.197384][ T9272]  out_of_line_wait_on_bit+0x74/0x80
<6>[ 5488.202604][ T9272]  do_get_write_access+0x1e4/0x3c0 [jbd2]
<6>[ 5488.208326][ T9272]  jbd2_journal_get_write_access+0x80/0xc0 [jbd2]
<6>[ 5488.214683][ T9272]  __ext4_journal_get_write_access+0x80/0x1a4
[ext4]
<6>[ 5488.221392][ T9272]  ext4_init_inode_table+0x228/0x3d0 [ext4]
<6>[ 5488.227298][ T9272]  ext4_lazyinit_thread+0x410/0x5f4 [ext4]
<6>[ 5488.233066][ T9272]  kthread+0x104/0x110
<6>[ 5488.237069][ T9272]  ret_from_fork+0x10/0x20

.

Thanks for the patch Kuai,

I ramped up the testing on multiple machines, and it seems I can still
hang with the patch, so this could be another problem. As mentioned
before I run on the 6.1.55 kernel, and never saw this on 5.10.72.

The blocked state is similar each time, with these same four tasks
hanging in the same place each time. Do you recognize this hang?

Okay, can you first clarify if it's still true that you said "Cherry-
picking [2] into 6.1 seems to resolve the hang".

There was another problem match the hang tasks, however, both 5.10 and
6.1 have this problem:

https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@xxxxxxxxxxxxx/T/#m62766c7d341eca35d6dcd446b6c289305b4f122e

BTW, using attr2line tool to change the offset from stack into code line
will be much better to locate the problem. And can you check if mainline
kernel still have this problem.

Thanks,
Kuai

Best regards
Gustav
.