Re: [Linux kernel bug] INFO: task hung in blk_mq_get_tag

Sam Sun <samsun1006219@xxxxxxxxx> · Tue, 14 May 2024 20:07:34 +0800

On Tue, May 14, 2024 at 6:37 PM Hillf Danton <hdanton@xxxxxxxx> wrote:
>
> On Tue, 14 May 2024 10:05:21 +0800 Sam Sun <samsun1006219@xxxxxxxxx>
> > On Tue, May 14, 2024 at 6:54 AM Hillf Danton <hdanton@xxxxxxxx> wrote:
> > > On Mon, 13 May 2024 20:57:44 +0800 Sam Sun <samsun1006219@xxxxxxxxx>
> > > >
> > > > I applied this patch and tried using the C repro, but it still crashed
> > > > with the same task hang kernel dump log.
> > >
> > > Oh low-hanging pear is sour, and try again seeing if there is missing
> > > wakeup due to wake batch.
> > >
> > > --- x/lib/sbitmap.c
> > > +++ y/lib/sbitmap.c
> > > @@ -579,6 +579,8 @@ void sbitmap_queue_wake_up(struct sbitma
> > >         unsigned int wake_batch = READ_ONCE(sbq->wake_batch);
> > >         unsigned int wakeups;
> > >
> > > +       __sbitmap_queue_wake_up(sbq, nr);
> > > +
> > >         if (!atomic_read(&sbq->ws_active))
> > >                 return;
> > >
> > > --
> >
> > I applied this patch together with the last patch. Unfortunately it
> > still crashed.
>
> After two rounds of test, what is clear now so far is -- it is IOs
> in flight that caused the task hung reported, though without spotting
> why they failed to complete within 120 seconds.
> >
> > Pointed out by Tetsuo, this kernel panic might be caused by sending
> > NMI between cpus. As dump log shows:
> > ```
> > [  429.046960][   T32] NMI backtrace for cpu 0
> > [  429.047499][   T32] CPU: 0 PID: 32 Comm: khungtaskd Not tainted 6.9.0-dirty #6
> > [  429.048417][   T32] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > [  429.049873][   T32] Call Trace:
> > [  429.050299][   T32]  <TASK>
> > [  429.050672][   T32]  dump_stack_lvl+0x201/0x300
> > ...
> > [  429.063133][   T32]  ret_from_fork_asm+0x11/0x20
> > [  429.063735][   T32]  </TASK>
> > [  429.064168][   T32] Sending NMI from CPU 0 to CPUs 1:
> > [  429.064833][   T32] BUG: unable to handle page fault for address:
> > ffffffff813d4cf1
>
> Given many syzbot reports without gpf like this one, I have difficulty
> understanding it. If it is printed after task hung detected, it should
> be a seperate issue.
>

I tried to run

# echo 0 > /proc/sys/kernel/hung_task_all_cpu_backtrace

before running the reproducer, the kernel stops panic. But still, even
if I terminate the execution of the reproducer, kernel continues
dumping task hung logs. After setting bung_task_all_cpu_backtrace back
to 1, it panic immediately during next dump. So I guess it is still a
task hung instead of general protection fault.

> > [  429.065765][   T32] #PF: supervisor write access in kernel mode
> > [  429.066502][   T32] #PF: error_code(0x0003) - permissions violation
> > [  429.067274][   T32] PGD db38067 P4D db38067 PUD db39063 PMD 12001a1
> > [  429.068068][   T32] Oops: 0003 [#1] PREEMPT SMP KASAN NOPTI
> > [  429.068767][   T32] CPU: 0 PID: 32 Comm: khungtaskd Not tainted
> > 6.9.0-dirty #6
> > [  429.069666][   T32] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > [  429.071142][   T32] RIP: 0010:__send_ipi_mask+0x541/0x690