Re: smart short test crashes software raid array?

Roman Mamedov <rm@xxxxxxxxxxx> · Sat, 9 Mar 2019 12:49:45 +0500

On Sat, 9 Mar 2019 01:09:18 -0500
Kent Dorfman <kent.dorfman766@xxxxxxxxx> wrote:

> On two occasions I have experienced two of four slices in a four disk
> array going offline in the wee hours of the morning while doing heavy
> batch processing and concurrently running smart short tests on the
> drives via crontab.
> 
> My crontab consecutively starts the short test on all drives, and
> since it backgrounds them, they all are running concurrently.
> 
> I suspect that the linux implementation of software raid is not able
> to handle the heavy disk IO while all the drives are also doing their
> smart test.
> 
> The first time this happened the first two stripes went offline, and
> the second time the other two went offline.  Luckily on both occasions
> I was able to reassemble the array (eight hour process) and did not
> lose any data.

It will be difficult for anyone to guess what happened without a corresponding
complete dmesg output to define what "crashes" or "went offline" meant
exactly. And also what are the drive models. For instance SMR-based drives are
known to be having a hard time if you try to put a ton of concurrent load on
them.

SMART test by itself shouldn't cause issues, as it's a low priority IO for the
drive, so in theory the host shouldn't even be aware that it is going on.

-- 
With respect,
Roman