Dear Roger,
Thank you for your reply.
Am 10.07.24 um 13:54 schrieb Roger Heflin:
How long does it freeze this way?
It froze up to five minutes I’d say.
The disks getting bad blocks do show up as stopping activity for 3-60
seconds (depending on the disks internal settings).
smartctl --xall <device> | grep -iE 'sector|reall' should show the
reallocation counters.
These are SAS disks, and none of the array members has any errors. Example:
```
@grele:~$ sudo smartctl --xall /dev/sdy
[…]
Error counter log:
Errors Corrected by Total Correction
Gigabytes Total
ECC rereads/ errors algorithm
processed uncorrected
fast | delayed rewrites corrected invocations [10^9
bytes] errors
read: 0 0 0 0 0 655487.372
0
write: 0 0 0 0 0 38289.771
0
```
What kind of disks does the machine have?
Seagate ST16000NM004J (16 TB, SAS)
On my home machine a bad sector freezes it for 7 seconds (scterc
defaults to 7). On some work large disk big raid the hang is minutes.
The raw disk is set to 10 (that is what the vendor told us) and
that 10 + having potentially a bunch of IOs against the bad sector
shows as minutes.
I wrote a script that work uses that both times how long smartctl
takes for each disk (the bad disk takes >5 seconds, and up to minutes)
and also shows the reallocated count and save a copy every hour so one
can see what disk incremented its counter in the last hour and replace
that disk.
A colleague also wrote a Perl program diskcheck, that is regularly run
to check all the disks. Nothing suspicious here.
Kind regards,
Paul