Re: How to debug intermittent increasing md/inflight but no disk activity?

Roger Heflin <rogerheflin@xxxxxxxxx> · Wed, 10 Jul 2024 06:54:47 -0500

How long does it freeze this way?

The disks getting bad blocks do show up as stopping activity for 3-60
seconds (depending on the disks internal settings).

smartctl --xall <device> | grep -iE 'sector|reall' should show the
reallocation counters.

What kind of disks does the machine have?

On my home machine a bad sector freezes it for 7 seconds (scterc
defaults to 7).  On some work large disk big raid the hang is minutes.
   The raw disk is set to 10 (that is what the vendor told us) and
that 10 + having potentially a bunch of IOs against the bad sector
shows as minutes.

I wrote a script that work uses that both times how long smartctl
takes for each disk (the bad disk takes >5 seconds, and up to minutes)
and also shows the reallocated count and save a copy every hour so one
can see what disk incremented its counter in the last hour and replace
that disk.

On Wed, Jul 10, 2024 at 6:46 AM Paul Menzel <pmenzel@xxxxxxxxxxxxx> wrote:
>
> Dear Linux folks,
>
>
> Exporting directories over NFS on a Dell PowerEdge R420 with Linux
> 5.15.86, users noticed intermittent hangs. For example,
>
>      df /project/something # on an NFS client
>
> on a different system timed out.
>
>      @grele:~$ more /proc/mdstat
>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
>      md3 : active raid6 sdr[0] sdp[11] sdx[10] sdt[9] sdo[8] sdw[7]
> sds[6] sdm[5] sdu[4] sdq[3] sdn[2] sdv[1]
>            156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm
> 2 [12/12] [UUUUUUUUUUUU]
>            bitmap: 0/117 pages [0KB], 65536KB chunk
>
>      md2 : active raid6 sdap[0] sdan[11] sdav[10] sdar[12] sdam[8]
> sdau[7] sdaq[6] sdak[5] sdas[4] sdao[3] sdal[2] sdat[1]
>            156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm
> 2 [12/12] [UUUUUUUUUUUU]
>            bitmap: 0/117 pages [0KB], 65536KB chunk
>
>      md1 : active raid6 sdb[0] sdl[11] sdh[10] sdd[9] sdk[8] sdg[7]
> sdc[6] sdi[5] sde[4] sda[3] sdj[2] sdf[1]
>            156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm
> 2 [12/12] [UUUUUUUUUUUU]
>            bitmap: 2/117 pages [8KB], 65536KB chunk
>
>      md0 : active raid6 sdaj[0] sdz[11] sdad[10] sdah[9] sdy[8] sdac[7]
> sdag[6] sdaa[5] sdae[4] sdai[3] sdab[2] sdaf[1]
>            156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm
> 2 [12/12] [UUUUUUUUUUUU]
>            bitmap: 7/117 pages [28KB], 65536KB chunk
>
>      unused devices: <none>
>
> In that time, we noticed all 64 NFSD processes being in uninterruptible
> sleep and the I/O requests currently in process increasing for the RAID6
> device *md0*
>
>      /sys/devices/virtual/block/md0/inflight : 10 921
>
> but with no disk activity according to iostat. There was only “little
> NFS activity” going on as far as we saw. This alternated for around half
> an our, and then we decreased the NFS processes from 64 to 8. After a
> while the problem settled, meaning the I/O requests went down, so it
> might be related to the access pattern, but we’d be curious to figure
> out exactly what is going on.
>
> We captured some more data from sysfs [1].
>
> Of course it’s not reproducible, but any insight how to debug this next
> time is much welcomed.
>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://owww.molgen.mpg.de/~pmenzel/grele.2.txt
>