Re: INFO: task hung in __get_super

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Wed, 20 Jun 2018 10:15:42 +0900

Dmitry Vyukov wrote:
> On Tue, Jun 19, 2018 at 4:10 PM, Tetsuo Handa
> <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
> > On 2018/06/19 20:53, Dmitry Vyukov wrote:
> >> On Tue, Jun 19, 2018 at 1:44 PM, Tetsuo Handa
> >> <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
> >>> This bug report is getting no feedback, but I guess that this bug is in
> >>> block or mm or locking layer rather than fs layer.
> >>>
> >>> NMI backtrace for this bug tends to report that sb_bread() from fill_super()
> >>>  from mount_bdev() is stalling is the cause of keep holding s_umount_key for
> >>> more than 120 seconds. What is strange is that NMI backtrace for this bug tends
> >>> to point at rcu_read_lock()/pagecache_get_page()/radix_tree_deref_slot()/
> >>> rcu_read_unlock() which is expected not to stall.
> >>>
> >>> Since CONFIG_RCU_CPU_STALL_TIMEOUT is set to 120 (and actually +5 due to
> >>> CONFIG_PROVE_RCU=y) which is longer than CONFIG_DEFAULT_HUNG_TASK_TIMEOUT,
> >>> maybe setting CONFIG_RCU_CPU_STALL_TIMEOUT to smaller values (e.g. 25) can
> >>> give us some hints...
> >>
> >> If an rcu stall is the true root cause of this, then I guess would see
> >> "rcu stall" bug too. Rcu stall is detected after 120 seconds, but task
> >> hang after 120-240 seconds. So rcu stall has much higher chances to be
> >> detected. Do you see the corresponding "rcu stall" bug?
> >
> > RCU stall is detected after 125 seconds due to CONFIG_PROVE_RCU=y
> > (e.g. https://syzkaller.appspot.com/bug?id=1fac0fd91219f3f2a03d6fa7deafc95fbed79cc2 ).
> >
> > I didn't find the corresponding "rcu stall" bug. But it is not required
> > that one RCU stall takes longer than 120 seconds.
> >
> > down(); // Will take 120 seconds due to multiple RCU stalls
> >   rcu_read_lock():
> >   do_something();
> >   rcu_read_unlock(): // Took 30 seconds for unknown reason.
> >   rcu_read_lock():
> >   do_something();
> >   rcu_read_unlock(): // Took 30 seconds for unknown reason.
> >   rcu_read_lock():
> >   do_something();
> >   rcu_read_unlock(): // Took 30 seconds for unknown reason.
> >   rcu_read_lock():
> >   do_something();
> >   rcu_read_unlock(): // Took 30 seconds for unknown reason.
> > up();
> 
> 
> You think this is another false positive?
> Like this one https://github.com/google/syzkaller/issues/516#issuecomment-395685629
> ?

According to https://syzkaller.appspot.com/text?tag=CrashLog&x=11db16c4400000 from
"INFO: rcu detected stall in __process_echoes":

[  859.630022] INFO: rcu_sched self-detected stall on CPU
[  859.635509] 	0-....: (124999 ticks this GP) idle=30e/1/4611686018427387906 softirq=287964/287964 fqs=31234 
[  859.645716] 	 (t=125000 jiffies g=156333 c=156332 q=555)
(...snipped...)
[  860.266660]  ? process_one_work+0x1ba0/0x1ba0
[  860.271135]  ? kthread_bind+0x40/0x40
[  860.274927]  ret_from_fork+0x3a/0x50
[  861.152252] INFO: task kworker/u4:2:59 blocked for more than 120 seconds.
[  861.159245]       Not tainted 4.18.0-rc1+ #109
[  861.163851] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

RCU stall was reported immediately before khungtaskd fires. Since syzbot assigns
only 2 CPUs, it might not be rare case that a hung process was unable to run just
because somebody else kept occupying CPU resources.

Well, "BUG: soft lockup in __process_echoes" will be a dup of
"INFO: rcu detected stall in __process_echoes". I wonder why
softlockup detector waited for 135 seconds.

Well, "BUG: soft lockup in shrink_dcache_parent (2)" and
"BUG: soft lockup in snd_virmidi_output_trigger" and
"BUG: soft lockup in smp_call_function_many" and
"BUG: soft lockup in do_raw_spin_unlock (2)" as well waited for 134 seconds
while "BUG: soft lockup in d_walk" waited for only 22 seconds...

Anyway, I think that in some cases RCU stalls/soft lockups are the cause of hung tasks.