How to debug intermittent increasing md/inflight but no disk activity?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Linux folks,


Exporting directories over NFS on a Dell PowerEdge R420 with Linux 5.15.86, users noticed intermittent hangs. For example,

    df /project/something # on an NFS client

on a different system timed out.

    @grele:~$ more /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] md3 : active raid6 sdr[0] sdp[11] sdx[10] sdt[9] sdo[8] sdw[7] sds[6] sdm[5] sdu[4] sdq[3] sdn[2] sdv[1] 156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
          bitmap: 0/117 pages [0KB], 65536KB chunk

md2 : active raid6 sdap[0] sdan[11] sdav[10] sdar[12] sdam[8] sdau[7] sdaq[6] sdak[5] sdas[4] sdao[3] sdal[2] sdat[1] 156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
          bitmap: 0/117 pages [0KB], 65536KB chunk

md1 : active raid6 sdb[0] sdl[11] sdh[10] sdd[9] sdk[8] sdg[7] sdc[6] sdi[5] sde[4] sda[3] sdj[2] sdf[1] 156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
          bitmap: 2/117 pages [8KB], 65536KB chunk

md0 : active raid6 sdaj[0] sdz[11] sdad[10] sdah[9] sdy[8] sdac[7] sdag[6] sdaa[5] sdae[4] sdai[3] sdab[2] sdaf[1] 156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
          bitmap: 7/117 pages [28KB], 65536KB chunk

    unused devices: <none>

In that time, we noticed all 64 NFSD processes being in uninterruptible sleep and the I/O requests currently in process increasing for the RAID6 device *md0*

    /sys/devices/virtual/block/md0/inflight : 10 921

but with no disk activity according to iostat. There was only “little NFS activity” going on as far as we saw. This alternated for around half an our, and then we decreased the NFS processes from 64 to 8. After a while the problem settled, meaning the I/O requests went down, so it might be related to the access pattern, but we’d be curious to figure out exactly what is going on.

We captured some more data from sysfs [1].

Of course it’s not reproducible, but any insight how to debug this next time is much welcomed.


Kind regards,

Paul


[1]: https://owww.molgen.mpg.de/~pmenzel/grele.2.txt




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux