On 5/16/24 03:10, Yu Kuai wrote: > Hi, > > 在 2024/05/15 19:57, Gustav Ekelund 写道: >> Hi, >> >> With raid5 syncing and ext4lazyinit running in parallel, I have a high >> probability of hanging on the 6.1.55 kernel (Log from blocked tasks >> below). I do not see this problem on the 5.10 kernel. >> >> In thread [4] patch [2] is described an regression going from 6.7 to >> 6.7.1, so it is unclear to me if this is the same issue. Let me know if >> I should reply on [4] if you think this could be the same issue. >> >> Cherry-picking [2] into 6.1 seems to resolve the hang, but following >> your discussion in [4] you later revert this patch in [3]. I tried to >> follow the thread, but I cannot figure out which patch is suggested to >> be used instead of [2]. >> >> Would you advice against running with [2] on v6.1? Should it be used in >> combination with [1] in that case? > > No, you should try this patch: > > https://lore.kernel.org/all/20240322081005.1112401-1-yukuai1@xxxxxxxxxxxxxxx/ > > Thanks, > Kuai > >> >> Best regards >> Gustav >> >> [1] commit d6e035aad6c0 ("md: bypass block throttle for superblock >> update") >> [2] commit bed9e27baf52 ("Revert "md/raid5: Wait for >> MD_SB_CHANGE_PENDING in raid5d"") >> [3] commit 3445139e3a59 ("Revert "Revert "md/raid5: Wait for >> MD_SB_CHANGE_PENDING in raid5d""") >> [4] >> https://lore.kernel.org/linux-raid/20240123005700.9302-1-dan@xxxxxxxx/ >> >> <6>[ 5487.973655][ T9272] sysrq: Show Blocked State >> <6>[ 5487.974388][ T9272] task:md127_raid5 state:D stack:0 >> pid:2619 ppid:2 flags:0x00000008 >> <6>[ 5487.983896][ T9272] Call trace: >> <6>[ 5487.987135][ T9272] __switch_to+0xc0/0x100 >> <6>[ 5487.991406][ T9272] __schedule+0x2a0/0x6b0 >> <6>[ 5487.995742][ T9272] schedule+0x54/0xb4 >> <6>[ 5487.999658][ T9272] raid5d+0x358/0x56c >> <6>[ 5488.003576][ T9272] md_thread+0xa8/0x15c >> <6>[ 5488.007723][ T9272] kthread+0x104/0x110 >> <6>[ 5488.011725][ T9272] ret_from_fork+0x10/0x20 >> <6>[ 5488.016080][ T9272] task:md127_resync state:D stack:0 >> pid:2620 ppid:2 flags:0x00000008 >> <6>[ 5488.025278][ T9272] Call trace: >> <6>[ 5488.028491][ T9272] __switch_to+0xc0/0x100 >> <6>[ 5488.032813][ T9272] __schedule+0x2a0/0x6b0 >> <6>[ 5488.037075][ T9272] schedule+0x54/0xb4 >> <6>[ 5488.041047][ T9272] raid5_get_active_stripe+0x1f4/0x454 >> <6>[ 5488.046441][ T9272] raid5_sync_request+0x350/0x390 >> <6>[ 5488.051401][ T9272] md_do_sync+0x8ac/0xcc4 >> <6>[ 5488.055722][ T9272] md_thread+0xa8/0x15c >> <6>[ 5488.059812][ T9272] kthread+0x104/0x110 >> <6>[ 5488.063814][ T9272] ret_from_fork+0x10/0x20 >> <6>[ 5488.068225][ T9272] task:jbd2/md127-8 state:D stack:0 >> pid:2675 ppid:2 flags:0x00000008 >> <6>[ 5488.077425][ T9272] Call trace: >> <6>[ 5488.080641][ T9272] __switch_to+0xc0/0x100 >> <6>[ 5488.084906][ T9272] __schedule+0x2a0/0x6b0 >> <6>[ 5488.089221][ T9272] schedule+0x54/0xb4 >> <6>[ 5488.093135][ T9272] md_write_start+0xfc/0x360 >> <6>[ 5488.097676][ T9272] raid5_make_request+0x68/0x117c >> <6>[ 5488.102695][ T9272] md_handle_request+0x21c/0x354 >> <6>[ 5488.107565][ T9272] md_submit_bio+0x74/0xb0 >> <6>[ 5488.111987][ T9272] __submit_bio+0x100/0x27c >> <6>[ 5488.116432][ T9272] submit_bio_noacct_nocheck+0xdc/0x260 >> <6>[ 5488.121910][ T9272] submit_bio_noacct+0x128/0x2e4 >> <6>[ 5488.126840][ T9272] submit_bio+0x34/0xdc >> <6>[ 5488.130935][ T9272] submit_bh_wbc+0x120/0x170 >> <6>[ 5488.135521][ T9272] submit_bh+0x14/0x20 >> <6>[ 5488.139527][ T9272] jbd2_journal_commit_transaction+0xccc/0x1520 >> [jbd2] >> <6>[ 5488.146400][ T9272] kjournald2+0xb0/0x250 [jbd2] >> <6>[ 5488.151194][ T9272] kthread+0x104/0x110 >> <6>[ 5488.155198][ T9272] ret_from_fork+0x10/0x20 >> <6>[ 5488.159608][ T9272] task:ext4lazyinit state:D stack:0 >> pid:2677 ppid:2 flags:0x00000008 >> <6>[ 5488.168811][ T9272] Call trace: >> <6>[ 5488.172026][ T9272] __switch_to+0xc0/0x100 >> <6>[ 5488.176291][ T9272] __schedule+0x2a0/0x6b0 >> <6>[ 5488.180618][ T9272] schedule+0x54/0xb4 >> <6>[ 5488.184538][ T9272] io_schedule+0x3c/0x60 >> <6>[ 5488.188714][ T9272] bit_wait_io+0x18/0x70 >> <6>[ 5488.192947][ T9272] __wait_on_bit+0x50/0x170 >> <6>[ 5488.197384][ T9272] out_of_line_wait_on_bit+0x74/0x80 >> <6>[ 5488.202604][ T9272] do_get_write_access+0x1e4/0x3c0 [jbd2] >> <6>[ 5488.208326][ T9272] jbd2_journal_get_write_access+0x80/0xc0 [jbd2] >> <6>[ 5488.214683][ T9272] __ext4_journal_get_write_access+0x80/0x1a4 >> [ext4] >> <6>[ 5488.221392][ T9272] ext4_init_inode_table+0x228/0x3d0 [ext4] >> <6>[ 5488.227298][ T9272] ext4_lazyinit_thread+0x410/0x5f4 [ext4] >> <6>[ 5488.233066][ T9272] kthread+0x104/0x110 >> <6>[ 5488.237069][ T9272] ret_from_fork+0x10/0x20 >> >> . >> > Thanks for the patch Kuai, I ramped up the testing on multiple machines, and it seems I can still hang with the patch, so this could be another problem. As mentioned before I run on the 6.1.55 kernel, and never saw this on 5.10.72. The blocked state is similar each time, with these same four tasks hanging in the same place each time. Do you recognize this hang? Best regards Gustav