Re: looking for assistance with jbd2 (and other processes) hung trying to write to disk

Chris Friesen <chris.friesen@xxxxxxxxxxxxx> · Tue, 10 Nov 2020 14:14:26 -0600

On 11/10/2020 1:46 PM, Theodore Y. Ts'o wrote:
[Please note this e-mail is from an EXTERNAL e-mail address]

On Tue, Nov 10, 2020 at 09:57:39AM -0600, Chris Friesen wrote:

Just to be sure, I'm looking for whoever has the BH_Lock bit set on the
buffer_head "b_state" field, right?  I don't see any ownership field the way
we have for mutexes.  Is there some way to find out who would have locked
the buffer?

It's quite possible that the buffer was locked as part of doing I/O,
and we are just waiting for the I/O to complete.  An example of this
is in journal_submit_commit_record(), where we lock the buffer using
lock_buffer(), and then call submit_bh() to submit the buffer for I/O.
When the I/O is completed, the buffer head will be unlocked, and we
can check the buffer_uptodate flag to see if the I/O completed
successfully.  (See journal_wait_on_commit_record() for an example of
this.)

Running "ps -m 'jbd2'" in the crashdump shows jbd2/nvme2n1p4- in the 
uninterruptible state, with a "last run" timestamp of over 9 minutes 
before the crash.  Same for a number of jbd2/dm* tasks.  This seems like 
a very long time to wait for I/O to complete, which is why I'm assuming 
something's gone off the rails.

So the first thing I'd suggest doing is looking at the console output
or dmesg output from the crashdump to see if there are any clues in
terms of kernel messages from the device driver before things locked
up.  This could be as simple as the device falling off the bus, in
which case there might be some kernel error messages from the block
layer or device driver that would give some insight.

The timeline looks like this (CPUs 0,1,24,25 are the housekeeping CPUS):

The only device-related issue I see is this, just a bit over 9 minutes 
before the eventual panic.  Prior to this there are no crashdump dmesg 
logs for a couple hours previous.
[119982.636995] WARNING: CPU: 1 PID: 21 at net/sched/sch_generic.c:360 
dev_watchdog+0x268/0x280
[119982.636997] NETDEV WATCHDOG: mh0 (iavf): transmit queue 3 timed out

Then I see rcu_sched self-detecting stalls:
[120024.146369] INFO: rcu_sched self-detected stall on CPU { 25} 
(t=60000 jiffies g=10078853 c=10078852 q=250)
[120203.725976] INFO: rcu_sched self-detected stall on CPU { 25} 
(t=240003 jiffies g=10078853 c=10078852 q=361)
[120383.305584] INFO: rcu_sched self-detected stall on CPU { 25} 
(t=420006 jiffies g=10078853 c=10078852 q=401)

The actual panic is here:
[120536.886219] Kernel panic - not syncing: Software Watchdog Timer expired
[120536.886221] CPU: 1 PID: 21 Comm: ktimersoftd/1 Kdump: loaded 
Tainted: G        W  O   ------------ T 
3.10.0-1127.rt56.1093.el7.tis.2.x86_64 #1

Chris