On Tue, Oct 22, 2024 at 05:59:13PM +0200, Niklas Cassel wrote: > On Tue, Oct 22, 2024 at 01:04:10PM +0200, Niklas Cassel wrote: > > On Tue, Oct 22, 2024 at 05:55:17PM +0800, Lai, Yi wrote: > > > On Tue, Oct 22, 2024 at 10:48:04AM +0200, Niklas Cassel wrote: > > > > On Tue, Oct 22, 2024 at 01:44:08PM +0800, Lai, Yi wrote: > > > > > On Mon, Oct 21, 2024 at 05:20:12PM +0200, Niklas Cassel wrote: > > > > > > On Mon, Oct 21, 2024 at 02:07:21PM +0200, Niklas Cassel wrote: > > > > > > > Hello Yi Lai, > > > > > > > > > > > > > > On Mon, Oct 21, 2024 at 06:58:59PM +0800, Lai, Yi wrote: > > > > > > > > Hi Niklas Cassel, > > > > > > > > > > > > > > > > Greetings! > > > > > > > > > > > > > > > > I used Syzkaller and found that there is INFO: task hung in blk_mq_get_tag in v6.12-rc3 > > > > > > > > > > > > > > > > After bisection and the first bad commit is: > > > > > > > > " > > > > > > > > e5dd410acb34 ata: libata: Clear DID_TIME_OUT for ATA PT commands with sense data > > > > > > > > " > > > > > > > > > > > > > > It might be that your bisection results are accurate. > > > > > > > > > > > > > > However, after looking at the stacktraces, I find it way more likely that > > > > > > > bisection has landed on the wrong commit. > > > > > > > > > > > > > > See this series that was just queued (for 6.13) a few days ago that solves a > > > > > > > similar starvation: > > > > > > > https://lore.kernel.org/linux-block/20241014092934.53630-1-songmuchun@xxxxxxxxxxxxx/ > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=for-6.13/block > > > > > > > > > > > > > > You could perhaps run with v6.14-rc4 (which should be able to trigger the bug) > > > > > > > and then try v6.14-rc4 + that series applied, to see if you can still trigger > > > > > > > the bug? > > > > > > > > > > > > I tried kernel linux-block > > > https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git > > > branch for-6.13/block commit c97f91b1807a7966077b69b24f28c2dbcde664e9. > > > > > > Issue can still be reproduced. > > > > Thanks again for your testing! > > > > I will try to reproduce using your reproducer (repo.c), > > and see if I can make any sense of this. > > Hello Yi Lai, > > > An important thing that I noticed is that even when I look at your good commit > (v5.11) - the first commit you marked as as good in git bisect, it has this: > https://github.com/laifryiee/syzkaller_logs/blob/main/241018_105830_blk_mq_get_tag/f40ddce88593482919761f74910f42f4b84c004b_dmesg.log > > [ 300.525626] __schedule+0xb9f/0x2eb0^M > [ 300.525843] ? __pfx___schedule+0x10/0x10^M > [ 300.526056] ? lock_release+0x441/0x870^M > [ 300.526286] ? __pfx_lock_release+0x10/0x10^M > [ 300.526512] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20^M > [ 300.526801] ? kthread_data+0x61/0xd0^M > [ 300.527007] schedule+0xf6/0x3f0^M > [ 300.527225] io_schedule+0xce/0x150^M > [ 300.527408] rq_qos_wait+0x1c5/0x310^M > [ 300.527610] ? __pfx_wbt_cleanup_cb+0x10/0x10^M > [ 300.527847] ? __pfx_rq_qos_wait+0x10/0x10^M > [ 300.528070] ? __pfx_lock_release+0x10/0x10^M > [ 300.528296] ? __pfx_rq_qos_wake_function+0x10/0x10^M > [ 300.528562] ? __pfx_wbt_inflight_cb+0x10/0x10^M > [ 300.528806] ? do_raw_spin_unlock+0x15c/0x210^M > [ 300.529048] wbt_wait+0x1ec/0x400^M > > In fact, most *_dmesg.log logs in your syzlog directory has a stack trace that > contains a "blocked for more than 147 seconds" error. > Many of these logs are (with the "blocked for more than 147 seconds" error) > are commits that your bisect script has classified as "good". > > > If we look at your own dmesg with Linux 6.12-rc3 + e5dd410acb34 reverted, > which passes your git bisect script, even if it has a > "blocked for more than 147 seconds" error: > https://github.com/laifryiee/syzkaller_logs/blob/main/241018_105830_blk_mq_get_tag/8e929cb546ee42c9a61d24fae60605e9e3192354_e5dd410acb34c7341a0a93b429dcf3dabf9e3323_revert_dmesg.log > It seems that the reason is because you don't find "blk_mq_get_tag" in dmesg: > |1019_054406|/var/www/html/bzimage/bzImage_8e929cb546ee42c9a61d24fae60605e9e3192354_e5dd410acb34c7341a0a93b429dcf3dabf9e3323_revert didn't contain blk_mq_get_tag: in dmesg, pass| > |1019_054406|Bisect successfully! 8e929cb546ee42c9a61d24fae60605e9e3192354_e5dd410acb34c7341a0a93b429dcf3dabf9e3323_revert bzimage passed!| > > So it seems that both v5.11 and "v6.12-rc3 + e5dd410acb34 reverted" has > problems. > > > I managed to reproduce using your reproducer, and with > v6.12-rc4 + e5dd410acb34 reverted, I see: > [ 299.641187] __schedule+0x1144/0x3200 > [ 299.641425] ? __pfx___schedule+0x10/0x10 > [ 299.641651] ? lock_release+0x4b0/0x870 > [ 299.641868] ? debug_smp_processor_id+0x1b/0x30 > [ 299.642128] ? __pfx_lock_release+0x10/0x10 > [ 299.642362] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 299.642640] ? lock_acquire+0x80/0xb0 > [ 299.642852] ? schedule+0x202/0x3d0 > [ 299.643068] schedule+0xe7/0x3d0 > [ 299.643261] io_schedule+0x96/0x100 > [ 299.643494] bit_wait_io+0x17/0xf0 > [ 299.643703] __wait_on_bit_lock+0x118/0x1a0 > [ 299.643941] ? __pfx_bit_wait_io+0x10/0x10 > [ 299.644195] out_of_line_wait_on_bit_lock+0xe8/0x120 > [ 299.644475] ? __pfx_out_of_line_wait_on_bit_lock+0x10/0x10 > [ 299.644793] ? __pfx_wake_bit_function+0x10/0x10 > [ 299.645060] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 299.645344] __lock_buffer+0x72/0x80 > [ 299.645569] do_get_write_access+0x7d0/0x1130 > [ 299.645838] jbd2_journal_get_write_access+0x1e9/0x270 > [ 299.646136] __ext4_journal_get_write_access+0x72/0x3b0 > [ 299.646436] ext4_reserve_inode_write+0x145/0x280 > > So even with e5dd410acb34 reverted, we see a problem! > > > > The difference seems to be the stack trace before and after e5dd410acb34 is > slightly different. > Before e5dd410acb34 the stack trace often looks like one of the two above. > After e5dd410acb34 the stack trace instead always has blk_mq_get (and seems > to be easier to reproduce). > > > I think that the stack trace that contains "blk_mq_get" actually indicates a > new problem though, so thank you for that! > > I have a fix for it: > diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c > index fa41ea57a978..3b303d4ae37a 100644 > --- a/drivers/ata/libata-eh.c > +++ b/drivers/ata/libata-eh.c > @@ -651,6 +651,7 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap, > /* the scmd has an associated qc */ > if (!(qc->flags & ATA_QCFLAG_EH)) { > /* which hasn't failed yet, timeout */ > + set_host_byte(scmd, DID_TIME_OUT); > qc->err_mask |= AC_ERR_TIMEOUT; > qc->flags |= ATA_QCFLAG_EH; > nr_timedout++; > > > > I will post it as a real fix for the new problem (blk_mq_get) tomorrow. > I applied this fix on top of linux v6.12-rc4. The new problem cannot be reproduced. If possible, please help add reported-by tag. > This just solves the new problem though. > > You probably need to do another bisection to find the problem that exists on > v5.11 and earlier kernels. Thanks for the analysis. It indeed gives direction for me to further improve the issue reproduction and bisection handling flow. I will follow up. Regards, Yi Lai > > > Kind regards, > Niklas