On Sat, 9 Mar 2019 01:09:18 -0500 Kent Dorfman <kent.dorfman766@xxxxxxxxx> wrote: > On two occasions I have experienced two of four slices in a four disk > array going offline in the wee hours of the morning while doing heavy > batch processing and concurrently running smart short tests on the > drives via crontab. > > My crontab consecutively starts the short test on all drives, and > since it backgrounds them, they all are running concurrently. > > I suspect that the linux implementation of software raid is not able > to handle the heavy disk IO while all the drives are also doing their > smart test. > > The first time this happened the first two stripes went offline, and > the second time the other two went offline. Luckily on both occasions > I was able to reassemble the array (eight hour process) and did not > lose any data. It will be difficult for anyone to guess what happened without a corresponding complete dmesg output to define what "crashes" or "went offline" meant exactly. And also what are the drive models. For instance SMR-based drives are known to be having a hard time if you try to put a ton of concurrent load on them. SMART test by itself shouldn't cause issues, as it's a low priority IO for the drive, so in theory the host shouldn't even be aware that it is going on. -- With respect, Roman