Re: raid5 hang on kernel v6.1 in combination with ext4lazyinit

Gustav Ekelund <gustav.ekelund@xxxxxxxx> · Thu, 16 May 2024 11:36:32 +0200

On 5/16/24 03:10, Yu Kuai wrote:
> Hi,
> 
> 在 2024/05/15 19:57, Gustav Ekelund 写道:
>> Hi,
>>
>> With raid5 syncing and ext4lazyinit running in parallel, I have a high
>> probability of hanging on the 6.1.55 kernel (Log from blocked tasks
>> below). I do not see this problem on the 5.10 kernel.
>>
>> In thread [4] patch [2] is described an regression going from 6.7 to
>> 6.7.1, so it is unclear to me if this is the same issue. Let me know if
>> I should reply on [4] if you think this could be the same issue.
>>
>> Cherry-picking [2] into 6.1 seems to resolve the hang, but following
>> your discussion in [4] you later revert this patch in [3]. I tried to
>> follow the thread, but I cannot figure out which patch is suggested to
>> be used instead of [2].
>>
>> Would you advice against running with [2] on v6.1? Should it be used in
>> combination with [1] in that case?
> 
> No, you should try this patch:
> 
> https://lore.kernel.org/all/20240322081005.1112401-1-yukuai1@xxxxxxxxxxxxxxx/
> 
> Thanks,
> Kuai
> 
>>
>> Best regards
>> Gustav
>>
>> [1] commit d6e035aad6c0 ("md: bypass block throttle for superblock
>> update")
>> [2] commit bed9e27baf52 ("Revert "md/raid5: Wait for
>> MD_SB_CHANGE_PENDING in raid5d"")
>> [3] commit 3445139e3a59 ("Revert "Revert "md/raid5: Wait for
>> MD_SB_CHANGE_PENDING in raid5d""")
>> [4]
>> https://lore.kernel.org/linux-raid/20240123005700.9302-1-dan@xxxxxxxx/
>>
>> <6>[ 5487.973655][ T9272] sysrq: Show Blocked State
>> <6>[ 5487.974388][ T9272] task:md127_raid5     state:D stack:0
>> pid:2619  ppid:2      flags:0x00000008
>> <6>[ 5487.983896][ T9272] Call trace:
>> <6>[ 5487.987135][ T9272]  __switch_to+0xc0/0x100
>> <6>[ 5487.991406][ T9272]  __schedule+0x2a0/0x6b0
>> <6>[ 5487.995742][ T9272]  schedule+0x54/0xb4
>> <6>[ 5487.999658][ T9272]  raid5d+0x358/0x56c
>> <6>[ 5488.003576][ T9272]  md_thread+0xa8/0x15c
>> <6>[ 5488.007723][ T9272]  kthread+0x104/0x110
>> <6>[ 5488.011725][ T9272]  ret_from_fork+0x10/0x20
>> <6>[ 5488.016080][ T9272] task:md127_resync    state:D stack:0
>> pid:2620  ppid:2      flags:0x00000008
>> <6>[ 5488.025278][ T9272] Call trace:
>> <6>[ 5488.028491][ T9272]  __switch_to+0xc0/0x100
>> <6>[ 5488.032813][ T9272]  __schedule+0x2a0/0x6b0
>> <6>[ 5488.037075][ T9272]  schedule+0x54/0xb4
>> <6>[ 5488.041047][ T9272]  raid5_get_active_stripe+0x1f4/0x454
>> <6>[ 5488.046441][ T9272]  raid5_sync_request+0x350/0x390
>> <6>[ 5488.051401][ T9272]  md_do_sync+0x8ac/0xcc4
>> <6>[ 5488.055722][ T9272]  md_thread+0xa8/0x15c
>> <6>[ 5488.059812][ T9272]  kthread+0x104/0x110
>> <6>[ 5488.063814][ T9272]  ret_from_fork+0x10/0x20
>> <6>[ 5488.068225][ T9272] task:jbd2/md127-8    state:D stack:0
>> pid:2675  ppid:2      flags:0x00000008
>> <6>[ 5488.077425][ T9272] Call trace:
>> <6>[ 5488.080641][ T9272]  __switch_to+0xc0/0x100
>> <6>[ 5488.084906][ T9272]  __schedule+0x2a0/0x6b0
>> <6>[ 5488.089221][ T9272]  schedule+0x54/0xb4
>> <6>[ 5488.093135][ T9272]  md_write_start+0xfc/0x360
>> <6>[ 5488.097676][ T9272]  raid5_make_request+0x68/0x117c
>> <6>[ 5488.102695][ T9272]  md_handle_request+0x21c/0x354
>> <6>[ 5488.107565][ T9272]  md_submit_bio+0x74/0xb0
>> <6>[ 5488.111987][ T9272]  __submit_bio+0x100/0x27c
>> <6>[ 5488.116432][ T9272]  submit_bio_noacct_nocheck+0xdc/0x260
>> <6>[ 5488.121910][ T9272]  submit_bio_noacct+0x128/0x2e4
>> <6>[ 5488.126840][ T9272]  submit_bio+0x34/0xdc
>> <6>[ 5488.130935][ T9272]  submit_bh_wbc+0x120/0x170
>> <6>[ 5488.135521][ T9272]  submit_bh+0x14/0x20
>> <6>[ 5488.139527][ T9272]  jbd2_journal_commit_transaction+0xccc/0x1520
>> [jbd2]
>> <6>[ 5488.146400][ T9272]  kjournald2+0xb0/0x250 [jbd2]
>> <6>[ 5488.151194][ T9272]  kthread+0x104/0x110
>> <6>[ 5488.155198][ T9272]  ret_from_fork+0x10/0x20
>> <6>[ 5488.159608][ T9272] task:ext4lazyinit    state:D stack:0
>> pid:2677  ppid:2      flags:0x00000008
>> <6>[ 5488.168811][ T9272] Call trace:
>> <6>[ 5488.172026][ T9272]  __switch_to+0xc0/0x100
>> <6>[ 5488.176291][ T9272]  __schedule+0x2a0/0x6b0
>> <6>[ 5488.180618][ T9272]  schedule+0x54/0xb4
>> <6>[ 5488.184538][ T9272]  io_schedule+0x3c/0x60
>> <6>[ 5488.188714][ T9272]  bit_wait_io+0x18/0x70
>> <6>[ 5488.192947][ T9272]  __wait_on_bit+0x50/0x170
>> <6>[ 5488.197384][ T9272]  out_of_line_wait_on_bit+0x74/0x80
>> <6>[ 5488.202604][ T9272]  do_get_write_access+0x1e4/0x3c0 [jbd2]
>> <6>[ 5488.208326][ T9272]  jbd2_journal_get_write_access+0x80/0xc0 [jbd2]
>> <6>[ 5488.214683][ T9272]  __ext4_journal_get_write_access+0x80/0x1a4
>> [ext4]
>> <6>[ 5488.221392][ T9272]  ext4_init_inode_table+0x228/0x3d0 [ext4]
>> <6>[ 5488.227298][ T9272]  ext4_lazyinit_thread+0x410/0x5f4 [ext4]
>> <6>[ 5488.233066][ T9272]  kthread+0x104/0x110
>> <6>[ 5488.237069][ T9272]  ret_from_fork+0x10/0x20
>>
>> .
>>
> 
Thanks for the patch Kuai,

I ramped up the testing on multiple machines, and it seems I can still
hang with the patch, so this could be another problem. As mentioned
before I run on the 6.1.55 kernel, and never saw this on 5.10.72.

The blocked state is similar each time, with these same four tasks
hanging in the same place each time. Do you recognize this hang?

Best regards
Gustav