Re: How to debug intermittent increasing md/inflight but no disk activity?

Paul Menzel <pmenzel@xxxxxxxxxxxxx> · Tue, 23 Jul 2024 12:33:33 +0200

Dear Roger,

Thank you for your reply.

Am 10.07.24 um 13:54 schrieb Roger Heflin:
How long does it freeze this way?

It froze up to five minutes I’d say.

The disks getting bad blocks do show up as stopping activity for 3-60
seconds (depending on the disks internal settings).

smartctl --xall <device> | grep -iE 'sector|reall' should show the
reallocation counters.

These are SAS disks, and none of the array members has any errors. Example:

```
@grele:~$ sudo smartctl --xall /dev/sdy
[…]
Error counter log:
           Errors Corrected by           Total   Correction 
Gigabytes    Total
               ECC          rereads/    errors   algorithm 
processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 
bytes]  errors
read:          0        0         0         0          0     655487.372 
         0
write:         0        0         0         0          0      38289.771 
         0
```

What kind of disks does the machine have?

Seagate ST16000NM004J (16 TB, SAS)

On my home machine a bad sector freezes it for 7 seconds (scterc
defaults to 7).  On some work large disk big raid the hang is minutes.
    The raw disk is set to 10 (that is what the vendor told us) and
that 10 + having potentially a bunch of IOs against the bad sector
shows as minutes.

I wrote a script that work uses that both times how long smartctl
takes for each disk (the bad disk takes >5 seconds, and up to minutes)
and also shows the reallocated count and save a copy every hour so one
can see what disk incremented its counter in the last hour and replace
that disk.

A colleague also wrote a Perl program diskcheck, that is regularly run 
to check all the disks. Nothing suspicious here.

Kind regards,

Paul