Re: looking for assistance with jbd2 (and other processes) hung trying to write to disk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/10/2020 5:42 AM, Jan Kara wrote:
On Mon 09-11-20 15:11:58, Chris Friesen wrote:

Can anyone give some suggestions on how to track down what's causing the
delay here?  I suspect there's a race condition somewhere similar to what
happened with https://access.redhat.com/solutions/3226391, although that one
was specific to device-mapper and the root filesystem here is directly on
the nvme device.

Sadly I don't have access to RH portal to be able to check what that hang
was about...

They had exactly the same stack trace (different addresses) with "jbd2/dm-16-8" trying to commit a journal transaction. In their case it was apparently due to two problems, "the RAID1 code leaking the r1bio", and "dm-raid not handling a needed retry scenario". They fixed it by backporting upstream commits. The kernel we're running should have those fixes, and in our case we're operating directly on an nvme device.

crash> ps -m 930
[0 00:09:11.694] [UN]  PID: 930    TASK: ffffa14b5f9032c0  CPU: 1 COMMAND:
"jbd2/nvme2n1p4-"


Are the tasks below the only ones hanging in D state (UN state in crash)?
Because I can see processes are waiting for the locked buffer but it is
unclear who is holding the buffer lock...

No, there are quite a few of them. I've included them below. I agree, it's not clear who's holding the lock. Is there a way to find that out?

Just to be sure, I'm looking for whoever has the BH_Lock bit set on the buffer_head "b_state" field, right? I don't see any ownership field the way we have for mutexes. Is there some way to find out who would have locked the buffer?

Do you think it would help at all to enable CONFIG_JBD_DEBUG?

Processes in "UN" state in crashdump:

crash> ps|grep UN
      1      0   1  ffffa14b687d8000  UN   0.0  193616   6620  systemd
930 2 1 ffffa14b5f9032c0 UN 0.0 0 0 [jbd2/nvme2n1p4-] 1489 2 1 ffffa14b641f0000 UN 0.0 0 0 [jbd2/dm-0-8] 1494 2 1 ffffa14b641f2610 UN 0.0 0 0 [jbd2/dm-11-8] 1523 2 1 ffffa14b64182610 UN 0.0 0 0 [jbd2/dm-1-8]
   1912      1   1  ffffa14b62dc2610  UN   0.0  117868  17568  syslog-ng
  86293      1   1  ffffa14ae4650cb0  UN   0.1 4618100 116664  containerd
  86314      1   1  ffffa14ae2639960  UN   0.1 4618100 116664  containerd
  88019      1   1  ffffa14ae26ad8d0  UN   0.2  651196 210260  safe_timer
  90539      1   1  ffffa13caca3bf70  UN   0.0   25868   2140  fsmond
  94006  93595   1  ffffa14ae31fe580  UN   0.1 13843140 113604  etcd
  95061  93508   1  ffffa14a913e8cb0  UN   0.1  721888 114652  log
  96367      1   1  ffffa14af53f9960  UN   0.0  119404  19084  python
  121292      1   1  ffffa14ae18932c0  UN   0.1 4618100 116664  containerd
122042 1 1 ffffa14a950a6580 UN 0.0 111680 9496 containerd-shim
  126119  122328  23  ffffa14b3d76a610  UN   0.0       0      0  com.xcg
  126171  122328  47  ffffa14a91571960  UN   0.0       0      0  com.xcg
  126173  122328  23  ffffa14a91573f70  UN   0.0       0      0  com.xcg
  126177  122328  23  ffffa14a91888000  UN   0.0       0      0  com.xcg
  128049  124763  47  ffffa14a964e6580  UN   0.1 1817292  80388  confd
  136938  136924   1  ffffa14b5bb7d8d0  UN   0.0  146256  25672  coredns
  136972  136924   1  ffffa14a9aae2610  UN   0.0  146256  25672  coredns
  136978  136924   1  ffffa14ae2238000  UN   0.0  146256  25672  coredns
  143026  142739   1  ffffa14b035e0000  UN   0.0       0      0  cainjector
  166456  165537  44  ffffa14af3cb8000  UN   0.0  325468  10736  nronmd.xcg
  166712  165537  44  ffffa149a2fecc20  UN   0.0  200116   3728  vpms.xcg
  166725  165537  44  ffffa14962fb6580  UN   0.1 2108336  58176  vrlcb.xcg
  166860  165537  45  ffffa14afd22bf70  UN   0.0  848320  12180  gcci.xcg
  166882  165537  45  ffffa14aff3c58d0  UN   0.0  693256  11624  ndc.xcg
  167556  165537  44  ffffa14929a6cc20  UN   0.0  119604   2612  gcdm.xcg
  170732  122328  23  ffffa1492987bf70  UN   0.0  616660   4348  com.xcg
  170741  122328  46  ffffa1492987cc20  UN   0.0       0      0  com.xcg
  170745  122328  23  ffffa1492987e580  UN   0.0       0      0  com.xcg
  170750  122328  23  ffffa14924d4f230  UN   0.0       0      0  com.xcg
  170774  122328  23  ffffa14924d4bf70  UN   0.0       0      0  com.xcg
  189870  187717  46  ffffa14873591960  UN   0.1  881516  83840  filebeat
  332649  136924   1  ffffa147efd49960  UN   0.0  146256  25672  coredns
1036113 3779184 23 ffffa13c9317bf70 UN 0.9 6703644 878848 pool-3-thread-1 1793349 2 1 ffffa14ae2402610 UN 0.0 0 0 [kworker/1:0]
  1850718  166101   0  ffffa14807448cb0  UN   0.0   18724   6068  exe
  1850727  1850722   0  ffffa147e18dd8d0  UN   0.0   18724   6068  exe
  1850733  120305   1  ffffa147e18da610  UN   0.0  135924   6512  runc
  1850792  128006  46  ffffa14ae1948cb0  UN   0.0   21716   1280  logrotate
  1850914  1850911   1  ffffa147086dbf70  UN   0.0   18724   6068  exe
1851274 127909 46 ffffa14703661960 UN 0.0 53344 3232 redis-server
  1851474  1850787   1  ffffa1470026cc20  UN   0.0  115704   1244  ceph
  1853422  1853340  44  ffffa146dfdc1960  UN   0.0   12396   2312  sh
  1854005      1   1  ffffa146d7d8f230  UN   0.0  116872    812  mkdir
  1854955  2847282   1  ffffa146c5d18cb0  UN   0.0   18724   6068  exe
  1856515  166108   1  ffffa146aa071960  UN   0.0   18724   6068  exe
  1856602  84624   1  ffffa146aa073f70  UN   0.0  184416   1988  crond
  1859661  1859658   1  ffffa14672090000  UN   0.0  116872    812  mkdir
2232051 165443 7 ffffa147e1ac0000 UN 0.0 0 0 eal-intr-thread


Thanks,
Chris




[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux