I would suspect maybe power getting overloaded on one of the 12V lines? I have been running long test on my drive weekly for years and never seen a crash related to it. You may want to check how the PS is wired, some of them have separate rails and you can overload one rail and not the other. I have also seen certain cheaper (non-Intel, non-AMD) sata chipsets lockup and take out all disks attached to that chipset (a M*rvel 6gbit chipset), and smart commands seem to cause that chipset/pci board to be extra unreliable, and I would lose all 4 ports on the card at the same moment, but in this case I believe a reboot was required to fix it. On Sat, Mar 9, 2019 at 3:47 AM Kent Dorfman <kent.dorfman766@xxxxxxxxx> wrote: > > nothing useful in the dmesg, and offline means registered as failed > under /proc/mdstat. or [UU..] instead of [UUUU]. > since the array was mounted at the time of failure, just read/write > errors when attempting to access files on it. > > These are the drive IDs from my script that spawns smart test. > > "ata-ST2000DL003-9VT166_5YD4TB1X" > "ata-ST2000DM001-1CH164_Z3408QTZ" > "ata-WDC_WD20EARX-00PASB0_WD-WCAZA8323338" > "ata-WDC_WD20EZRZ-00Z5HB0_WD-WCC4M4SH8A8V" > > WD and seagate 2TB drives. nothing special about them. > > As far as low priority IO goes, if all four disks are doing the test > concurrently and the raid driver is trying to do IO, then it seems to > me that a raid operation timeout becomes a possibility. if I > understand correctly, then the smart test is in the driver firmware > and is a scsi/ata command function sent to the drive. Once it's > started then the host system is out of the loop, right? > > > > On 3/9/19, Roman Mamedov <rm@xxxxxxxxxxx> wrote: > > On Sat, 9 Mar 2019 01:09:18 -0500 > > Kent Dorfman <kent.dorfman766@xxxxxxxxx> wrote: > > > >> On two occasions I have experienced two of four slices in a four disk > >> array going offline in the wee hours of the morning while doing heavy > >> batch processing and concurrently running smart short tests on the > >> drives via crontab. > >> > >> My crontab consecutively starts the short test on all drives, and > >> since it backgrounds them, they all are running concurrently. > >> > >> I suspect that the linux implementation of software raid is not able > >> to handle the heavy disk IO while all the drives are also doing their > >> smart test. > >> > >> The first time this happened the first two stripes went offline, and > >> the second time the other two went offline. Luckily on both occasions > >> I was able to reassemble the array (eight hour process) and did not > >> lose any data. > > > > It will be difficult for anyone to guess what happened without a > > corresponding > > complete dmesg output to define what "crashes" or "went offline" meant > > exactly. And also what are the drive models. For instance SMR-based drives > > are > > known to be having a hard time if you try to put a ton of concurrent load > > on > > them. > > > > SMART test by itself shouldn't cause issues, as it's a low priority IO for > > the > > drive, so in theory the host shouldn't even be aware that it is going on. > > > > -- > > With respect, > > Roman > >