Re: smart short test crashes software raid array?

Kent Dorfman <kent.dorfman766@xxxxxxxxx> · Sat, 9 Mar 2019 11:43:16 -0500

On 3/9/19, Leslie Rhorer <lrhorer@xxxxxxxxxxxx> wrote:
>      Oh, and BTW, you say you are losing two spindles, right? From your
> response, I loosely infer it is either the two Seagate drives or else
> the two Western Digital drives that are being dropped.  Is that
> correct?  If so, doesn 't that suggest something to you?
>
>      On a different note, you don't mention what RAID level you are
> running, but given the fact the array croaks when two drives are
> removed, I take it the array is RAID5 or less?  I suggest at least RAID6
> and I would also suggest a hot spare.
>
> On 3/9/2019 8:41 AM, Leslie Rhorer wrote:
>>     I think you may have used the term "driver" twice when you meant
>> to say, "Drive".  The SMART operations are all contained fully within
>> the hard drive hardware, so there is no direct interaction between
>> RAID or any other host software and the SMART facilities, including
>> the tests.  In a properly functioning hard drive, SMART operations
>> should not cause an I/O call timeout.  I believe there are timeout
>> traps in the ATA / SCSI I/O drivers, but I don't think there are any
>> timeout traps in RAID.  I could be wrong on that point, however.
>>
>>     RAID itself is not going to be failure susceptible to bandwidth
>> limitations of the CPU or the I/O subsystems.  One could run RAID on a
>> bunch of floppy disks, and it wouldn't care.  I am guessing you have a
>> hardware issue.  Since the issue reportedly only shows up when SMART
>> tests are running, I would suspect the hard drives themselves.
>>
>>     Roger's suggestions are also reasonable.  I would start by backing
>> up all the data.  I will say that again: back up all your data.  Then
>> I would try swapping  hardware, starting with the drives, and then the
>> power supply.  How big is the power supply, by the way?  Then I would
>> try drive cables, followed by the HBA / Drive controller.  Are you
>> using an external drive enclosure, or are the drives internal to the PC?
>>
>> On 3/9/2019 3:43 AM, Kent Dorfman wrote:
>>> nothing useful in the dmesg, and offline means registered as failed
>>> under /proc/mdstat. or [UU..] instead of [UUUU].
>>> since the array was mounted at the time of failure, just read/write
>>> errors when attempting to access files on it.
>>>
>>> These are the drive IDs from my script that spawns smart test.
>>>
>>> "ata-ST2000DL003-9VT166_5YD4TB1X"
>>> "ata-ST2000DM001-1CH164_Z3408QTZ"
>>> "ata-WDC_WD20EARX-00PASB0_WD-WCAZA8323338"
>>> "ata-WDC_WD20EZRZ-00Z5HB0_WD-WCC4M4SH8A8V"
>>>
>>> WD and seagate 2TB drives.  nothing special about them.
>>>
>>> As far as low priority IO goes, if all four disks are doing the test
>>> concurrently and the raid driver is trying to do IO, then it seems to
>>> me that a raid operation timeout becomes a possibility.  if I
>>> understand correctly, then the smart test is in the driver firmware
>>> and is a scsi/ata command function sent to the drive.  Once it's
>>> started then the host system is out of the loop, right?
>>>
>>>
>>>
>>> On 3/9/19, Roman Mamedov <rm@xxxxxxxxxxx> wrote:
>>>> On Sat, 9 Mar 2019 01:09:18 -0500
>>>> Kent Dorfman <kent.dorfman766@xxxxxxxxx> wrote:
>>>>
>>>>> On two occasions I have experienced two of four slices in a four disk
>>>>> array going offline in the wee hours of the morning while doing heavy
>>>>> batch processing and concurrently running smart short tests on the
>>>>> drives via crontab.
>>>>>
>>>>> My crontab consecutively starts the short test on all drives, and
>>>>> since it backgrounds them, they all are running concurrently.
>>>>>
>>>>> I suspect that the linux implementation of software raid is not able
>>>>> to handle the heavy disk IO while all the drives are also doing their
>>>>> smart test.
>>>>>
>>>>> The first time this happened the first two stripes went offline, and
>>>>> the second time the other two went offline.  Luckily on both occasions
>>>>> I was able to reassemble the array (eight hour process) and did not
>>>>> lose any data.
>>>> It will be difficult for anyone to guess what happened without a
>>>> corresponding
>>>> complete dmesg output to define what "crashes" or "went offline" meant
>>>> exactly. And also what are the drive models. For instance SMR-based
>>>> drives
>>>> are
>>>> known to be having a hard time if you try to put a ton of concurrent
>>>> load
>>>> on
>>>> them.
>>>>
>>>> SMART test by itself shouldn't cause issues, as it's a low priority
>>>> IO for
>>>> the
>>>> drive, so in theory the host shouldn't even be aware that it is
>>>> going on.
>>>>
>>>> --
>>>> With respect,
>>>> Roman
>>>>
>>
>> ---
>> This email has been checked for viruses by Avast antivirus software.
>> https://www.avast.com/antivirus
>>
>

internal raid5, no room for a spare...drives are staggered. sda and
sdc are WD.  two drives on each power rail...the machine is a rack
mount server.  you don't get supplies with weak power rails on a
milspec server, FWIW  I looked closely at the max power utilization
for each component and guaranteed that in each case I had an adequate
safety margin on the power budget.  My profession requires that I
understand power budgets, startup current, flyback, noisy signals,
etc.

Been running for a month now with no hiccups since disabling the
concurrent smart tests.