Re: smart short test crashes software raid array?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



    Oh, and BTW, you say you are losing two spindles, right? From your response, I loosely infer it is either the two Seagate drives or else the two Western Digital drives that are being dropped.  Is that correct?  If so, doesn 't that suggest something to you?

    On a different note, you don't mention what RAID level you are running, but given the fact the array croaks when two drives are removed, I take it the array is RAID5 or less?  I suggest at least RAID6 and I would also suggest a hot spare.

On 3/9/2019 8:41 AM, Leslie Rhorer wrote:
    I think you may have used the term "driver" twice when you meant to say, "Drive".  The SMART operations are all contained fully within the hard drive hardware, so there is no direct interaction between RAID or any other host software and the SMART facilities, including the tests.  In a properly functioning hard drive, SMART operations should not cause an I/O call timeout.  I believe there are timeout traps in the ATA / SCSI I/O drivers, but I don't think there are any timeout traps in RAID.  I could be wrong on that point, however.

    RAID itself is not going to be failure susceptible to bandwidth limitations of the CPU or the I/O subsystems.  One could run RAID on a bunch of floppy disks, and it wouldn't care.  I am guessing you have a hardware issue.  Since the issue reportedly only shows up when SMART tests are running, I would suspect the hard drives themselves.

    Roger's suggestions are also reasonable.  I would start by backing up all the data.  I will say that again: back up all your data.  Then I would try swapping  hardware, starting with the drives, and then the power supply.  How big is the power supply, by the way?  Then I would try drive cables, followed by the HBA / Drive controller.  Are you using an external drive enclosure, or are the drives internal to the PC?

On 3/9/2019 3:43 AM, Kent Dorfman wrote:
nothing useful in the dmesg, and offline means registered as failed
under /proc/mdstat. or [UU..] instead of [UUUU].
since the array was mounted at the time of failure, just read/write
errors when attempting to access files on it.

These are the drive IDs from my script that spawns smart test.

"ata-ST2000DL003-9VT166_5YD4TB1X"
"ata-ST2000DM001-1CH164_Z3408QTZ"
"ata-WDC_WD20EARX-00PASB0_WD-WCAZA8323338"
"ata-WDC_WD20EZRZ-00Z5HB0_WD-WCC4M4SH8A8V"

WD and seagate 2TB drives.  nothing special about them.

As far as low priority IO goes, if all four disks are doing the test
concurrently and the raid driver is trying to do IO, then it seems to
me that a raid operation timeout becomes a possibility.  if I
understand correctly, then the smart test is in the driver firmware
and is a scsi/ata command function sent to the drive.  Once it's
started then the host system is out of the loop, right?



On 3/9/19, Roman Mamedov <rm@xxxxxxxxxxx> wrote:
On Sat, 9 Mar 2019 01:09:18 -0500
Kent Dorfman <kent.dorfman766@xxxxxxxxx> wrote:

On two occasions I have experienced two of four slices in a four disk
array going offline in the wee hours of the morning while doing heavy
batch processing and concurrently running smart short tests on the
drives via crontab.

My crontab consecutively starts the short test on all drives, and
since it backgrounds them, they all are running concurrently.

I suspect that the linux implementation of software raid is not able
to handle the heavy disk IO while all the drives are also doing their
smart test.

The first time this happened the first two stripes went offline, and
the second time the other two went offline.  Luckily on both occasions
I was able to reassemble the array (eight hour process) and did not
lose any data.
It will be difficult for anyone to guess what happened without a
corresponding
complete dmesg output to define what "crashes" or "went offline" meant
exactly. And also what are the drive models. For instance SMR-based drives
are
known to be having a hard time if you try to put a ton of concurrent load
on
them.

SMART test by itself shouldn't cause issues, as it's a low priority IO for
the
drive, so in theory the host shouldn't even be aware that it is going on.

--
With respect,
Roman


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux