smart short test crashes software raid array?

Kent Dorfman <kent.dorfman766@xxxxxxxxx> · Sat, 9 Mar 2019 01:09:18 -0500

On two occasions I have experienced two of four slices in a four disk
array going offline in the wee hours of the morning while doing heavy
batch processing and concurrently running smart short tests on the
drives via crontab.

My crontab consecutively starts the short test on all drives, and
since it backgrounds them, they all are running concurrently.

I suspect that the linux implementation of software raid is not able
to handle the heavy disk IO while all the drives are also doing their
smart test.

The first time this happened the first two stripes went offline, and
the second time the other two went offline.  Luckily on both occasions
I was able to reassemble the array (eight hour process) and did not
lose any data.

This email is more of an FYI, or looking for verification that what I
experienced was probably caused by the events as I outlined them
above.

Thoughts?

Linux 4.20.x kernel line on SMP E5440 server
supermicro X7DAE motherboard 32GB of ECC RAM