Re: smart short test crashes software raid array?

Kent Dorfman <kent.dorfman766@xxxxxxxxx> · Sat, 9 Mar 2019 04:43:27 -0500

nothing useful in the dmesg, and offline means registered as failed
under /proc/mdstat. or [UU..] instead of [UUUU].
since the array was mounted at the time of failure, just read/write
errors when attempting to access files on it.

These are the drive IDs from my script that spawns smart test.

"ata-ST2000DL003-9VT166_5YD4TB1X"
"ata-ST2000DM001-1CH164_Z3408QTZ"
"ata-WDC_WD20EARX-00PASB0_WD-WCAZA8323338"
"ata-WDC_WD20EZRZ-00Z5HB0_WD-WCC4M4SH8A8V"

WD and seagate 2TB drives.  nothing special about them.

As far as low priority IO goes, if all four disks are doing the test
concurrently and the raid driver is trying to do IO, then it seems to
me that a raid operation timeout becomes a possibility.  if I
understand correctly, then the smart test is in the driver firmware
and is a scsi/ata command function sent to the drive.  Once it's
started then the host system is out of the loop, right?

On 3/9/19, Roman Mamedov <rm@xxxxxxxxxxx> wrote:
> On Sat, 9 Mar 2019 01:09:18 -0500
> Kent Dorfman <kent.dorfman766@xxxxxxxxx> wrote:
>
>> On two occasions I have experienced two of four slices in a four disk
>> array going offline in the wee hours of the morning while doing heavy
>> batch processing and concurrently running smart short tests on the
>> drives via crontab.
>>
>> My crontab consecutively starts the short test on all drives, and
>> since it backgrounds them, they all are running concurrently.
>>
>> I suspect that the linux implementation of software raid is not able
>> to handle the heavy disk IO while all the drives are also doing their
>> smart test.
>>
>> The first time this happened the first two stripes went offline, and
>> the second time the other two went offline.  Luckily on both occasions
>> I was able to reassemble the array (eight hour process) and did not
>> lose any data.
>
> It will be difficult for anyone to guess what happened without a
> corresponding
> complete dmesg output to define what "crashes" or "went offline" meant
> exactly. And also what are the drive models. For instance SMR-based drives
> are
> known to be having a hard time if you try to put a ton of concurrent load
> on
> them.
>
> SMART test by itself shouldn't cause issues, as it's a low priority IO for
> the
> drive, so in theory the host shouldn't even be aware that it is going on.
>
> --
> With respect,
> Roman
>