Re: smart short test crashes software raid array?

Leslie Rhorer <lrhorer@xxxxxxxxxxxx> · Sat, 9 Mar 2019 09:54:13 -0600

    Oh, and BTW, you say you are losing two spindles, right? From your 
response, I loosely infer it is either the two Seagate drives or else 
the two Western Digital drives that are being dropped.  Is that 
correct?  If so, doesn 't that suggest something to you?

    On a different note, you don't mention what RAID level you are 
running, but given the fact the array croaks when two drives are 
removed, I take it the array is RAID5 or less?  I suggest at least RAID6 
and I would also suggest a hot spare.

On 3/9/2019 8:41 AM, Leslie Rhorer wrote:
    I think you may have used the term "driver" twice when you meant 
to say, "Drive".  The SMART operations are all contained fully within 
the hard drive hardware, so there is no direct interaction between 
RAID or any other host software and the SMART facilities, including 
the tests.  In a properly functioning hard drive, SMART operations 
should not cause an I/O call timeout.  I believe there are timeout 
traps in the ATA / SCSI I/O drivers, but I don't think there are any 
timeout traps in RAID.  I could be wrong on that point, however.

    RAID itself is not going to be failure susceptible to bandwidth 
limitations of the CPU or the I/O subsystems.  One could run RAID on a 
bunch of floppy disks, and it wouldn't care.  I am guessing you have a 
hardware issue.  Since the issue reportedly only shows up when SMART 
tests are running, I would suspect the hard drives themselves.

    Roger's suggestions are also reasonable.  I would start by backing 
up all the data.  I will say that again: back up all your data.  Then 
I would try swapping  hardware, starting with the drives, and then the 
power supply.  How big is the power supply, by the way?  Then I would 
try drive cables, followed by the HBA / Drive controller.  Are you 
using an external drive enclosure, or are the drives internal to the PC?

On 3/9/2019 3:43 AM, Kent Dorfman wrote:
nothing useful in the dmesg, and offline means registered as failed
under /proc/mdstat. or [UU..] instead of [UUUU].
since the array was mounted at the time of failure, just read/write
errors when attempting to access files on it.

These are the drive IDs from my script that spawns smart test.

"ata-ST2000DL003-9VT166_5YD4TB1X"
"ata-ST2000DM001-1CH164_Z3408QTZ"
"ata-WDC_WD20EARX-00PASB0_WD-WCAZA8323338"
"ata-WDC_WD20EZRZ-00Z5HB0_WD-WCC4M4SH8A8V"

WD and seagate 2TB drives.  nothing special about them.

As far as low priority IO goes, if all four disks are doing the test
concurrently and the raid driver is trying to do IO, then it seems to
me that a raid operation timeout becomes a possibility.  if I
understand correctly, then the smart test is in the driver firmware
and is a scsi/ata command function sent to the drive.  Once it's
started then the host system is out of the loop, right?

On 3/9/19, Roman Mamedov <rm@xxxxxxxxxxx> wrote:
On Sat, 9 Mar 2019 01:09:18 -0500
Kent Dorfman <kent.dorfman766@xxxxxxxxx> wrote:

On two occasions I have experienced two of four slices in a four disk
array going offline in the wee hours of the morning while doing heavy
batch processing and concurrently running smart short tests on the
drives via crontab.

My crontab consecutively starts the short test on all drives, and
since it backgrounds them, they all are running concurrently.

I suspect that the linux implementation of software raid is not able
to handle the heavy disk IO while all the drives are also doing their
smart test.

The first time this happened the first two stripes went offline, and
the second time the other two went offline.  Luckily on both occasions
I was able to reassemble the array (eight hour process) and did not
lose any data.
It will be difficult for anyone to guess what happened without a
corresponding
complete dmesg output to define what "crashes" or "went offline" meant
exactly. And also what are the drive models. For instance SMR-based 
drives
are
known to be having a hard time if you try to put a ton of concurrent 
load
on
them.

SMART test by itself shouldn't cause issues, as it's a low priority 
IO for
the
drive, so in theory the host shouldn't even be aware that it is 
going on.

--
With respect,
Roman

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus