Re: looking for assistance with jbd2 (and other processes) hung trying to write to disk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/10/2020 1:46 PM, Theodore Y. Ts'o wrote:
[Please note this e-mail is from an EXTERNAL e-mail address]

On Tue, Nov 10, 2020 at 09:57:39AM -0600, Chris Friesen wrote:

Just to be sure, I'm looking for whoever has the BH_Lock bit set on the
buffer_head "b_state" field, right?  I don't see any ownership field the way
we have for mutexes.  Is there some way to find out who would have locked
the buffer?

It's quite possible that the buffer was locked as part of doing I/O,
and we are just waiting for the I/O to complete.  An example of this
is in journal_submit_commit_record(), where we lock the buffer using
lock_buffer(), and then call submit_bh() to submit the buffer for I/O.
When the I/O is completed, the buffer head will be unlocked, and we
can check the buffer_uptodate flag to see if the I/O completed
successfully.  (See journal_wait_on_commit_record() for an example of
this.)

Running "ps -m 'jbd2'" in the crashdump shows jbd2/nvme2n1p4- in the uninterruptible state, with a "last run" timestamp of over 9 minutes before the crash. Same for a number of jbd2/dm* tasks. This seems like a very long time to wait for I/O to complete, which is why I'm assuming something's gone off the rails.

So the first thing I'd suggest doing is looking at the console output
or dmesg output from the crashdump to see if there are any clues in
terms of kernel messages from the device driver before things locked
up.  This could be as simple as the device falling off the bus, in
which case there might be some kernel error messages from the block
layer or device driver that would give some insight.

The timeline looks like this (CPUs 0,1,24,25 are the housekeeping CPUS):

The only device-related issue I see is this, just a bit over 9 minutes before the eventual panic. Prior to this there are no crashdump dmesg logs for a couple hours previous. [119982.636995] WARNING: CPU: 1 PID: 21 at net/sched/sch_generic.c:360 dev_watchdog+0x268/0x280
[119982.636997] NETDEV WATCHDOG: mh0 (iavf): transmit queue 3 timed out

Then I see rcu_sched self-detecting stalls:
[120024.146369] INFO: rcu_sched self-detected stall on CPU { 25} (t=60000 jiffies g=10078853 c=10078852 q=250) [120203.725976] INFO: rcu_sched self-detected stall on CPU { 25} (t=240003 jiffies g=10078853 c=10078852 q=361) [120383.305584] INFO: rcu_sched self-detected stall on CPU { 25} (t=420006 jiffies g=10078853 c=10078852 q=401)

The actual panic is here:
[120536.886219] Kernel panic - not syncing: Software Watchdog Timer expired
[120536.886221] CPU: 1 PID: 21 Comm: ktimersoftd/1 Kdump: loaded Tainted: G W O ------------ T 3.10.0-1127.rt56.1093.el7.tis.2.x86_64 #1


Chris



[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux