Re: smart short test crashes software raid array?

Leslie Rhorer <lrhorer@xxxxxxxxxxxx> · Sat, 9 Mar 2019 13:06:02 -0600

    You might consider employing an external drive enclosure, or if 
space outside the server is tight, you might consider using 2.5" 
adapters  and swapping your drives for six or eight 2.5" drives, or 
maybe even SSDs.  You might even increase your storage, while you are at 
it.  One can get 4T, 2.5" hard drives for just a little over $100.  For 
around $1000, you could increase your storage from 6T with one parity to 
16T with two parities and two hot spares.

On 3/9/2019 10:43 AM, Kent Dorfman wrote:
On 3/9/19, Leslie Rhorer <lrhorer@xxxxxxxxxxxx> wrote:
      Oh, and BTW, you say you are losing two spindles, right? From your
response, I loosely infer it is either the two Seagate drives or else
the two Western Digital drives that are being dropped.  Is that
correct?  If so, doesn 't that suggest something to you?

      On a different note, you don't mention what RAID level you are
running, but given the fact the array croaks when two drives are
removed, I take it the array is RAID5 or less?  I suggest at least RAID6
and I would also suggest a hot spare.

On 3/9/2019 8:41 AM, Leslie Rhorer wrote:
     I think you may have used the term "driver" twice when you meant
to say, "Drive".  The SMART operations are all contained fully within
the hard drive hardware, so there is no direct interaction between
RAID or any other host software and the SMART facilities, including
the tests.  In a properly functioning hard drive, SMART operations
should not cause an I/O call timeout.  I believe there are timeout
traps in the ATA / SCSI I/O drivers, but I don't think there are any
timeout traps in RAID.  I could be wrong on that point, however.

     RAID itself is not going to be failure susceptible to bandwidth
limitations of the CPU or the I/O subsystems.  One could run RAID on a
bunch of floppy disks, and it wouldn't care.  I am guessing you have a
hardware issue.  Since the issue reportedly only shows up when SMART
tests are running, I would suspect the hard drives themselves.

     Roger's suggestions are also reasonable.  I would start by backing
up all the data.  I will say that again: back up all your data.  Then
I would try swapping  hardware, starting with the drives, and then the
power supply.  How big is the power supply, by the way?  Then I would
try drive cables, followed by the HBA / Drive controller.  Are you
using an external drive enclosure, or are the drives internal to the PC?

On 3/9/2019 3:43 AM, Kent Dorfman wrote:
nothing useful in the dmesg, and offline means registered as failed
under /proc/mdstat. or [UU..] instead of [UUUU].
since the array was mounted at the time of failure, just read/write
errors when attempting to access files on it.

These are the drive IDs from my script that spawns smart test.

"ata-ST2000DL003-9VT166_5YD4TB1X"
"ata-ST2000DM001-1CH164_Z3408QTZ"
"ata-WDC_WD20EARX-00PASB0_WD-WCAZA8323338"
"ata-WDC_WD20EZRZ-00Z5HB0_WD-WCC4M4SH8A8V"

WD and seagate 2TB drives.  nothing special about them.

As far as low priority IO goes, if all four disks are doing the test
concurrently and the raid driver is trying to do IO, then it seems to
me that a raid operation timeout becomes a possibility.  if I
understand correctly, then the smart test is in the driver firmware
and is a scsi/ata command function sent to the drive.  Once it's
started then the host system is out of the loop, right?

On 3/9/19, Roman Mamedov <rm@xxxxxxxxxxx> wrote:
On Sat, 9 Mar 2019 01:09:18 -0500
Kent Dorfman <kent.dorfman766@xxxxxxxxx> wrote:

On two occasions I have experienced two of four slices in a four disk
array going offline in the wee hours of the morning while doing heavy
batch processing and concurrently running smart short tests on the
drives via crontab.

My crontab consecutively starts the short test on all drives, and
since it backgrounds them, they all are running concurrently.

I suspect that the linux implementation of software raid is not able
to handle the heavy disk IO while all the drives are also doing their
smart test.

The first time this happened the first two stripes went offline, and
the second time the other two went offline.  Luckily on both occasions
I was able to reassemble the array (eight hour process) and did not
lose any data.
It will be difficult for anyone to guess what happened without a
corresponding
complete dmesg output to define what "crashes" or "went offline" meant
exactly. And also what are the drive models. For instance SMR-based
drives
are
known to be having a hard time if you try to put a ton of concurrent
load
on
them.

SMART test by itself shouldn't cause issues, as it's a low priority
IO for
the
drive, so in theory the host shouldn't even be aware that it is
going on.

--
With respect,
Roman

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

internal raid5, no room for a spare...drives are staggered. sda and
sdc are WD.  two drives on each power rail...the machine is a rack
mount server.  you don't get supplies with weak power rails on a
milspec server, FWIW  I looked closely at the max power utilization
for each component and guaranteed that in each case I had an adequate
safety margin on the power budget.  My profession requires that I
understand power budgets, startup current, flyback, noisy signals,
etc.

Been running for a month now with no hiccups since disabling the
concurrent smart tests.