Re: smart short test crashes software raid array?

Roger Heflin <rogerheflin@xxxxxxxxx> · Sat, 9 Mar 2019 06:09:37 -0600

I would suspect maybe power getting overloaded on one of the 12V
lines?   I have been running long test on my drive weekly for years
and never seen a crash related to it.  You may want to check how the
PS is wired, some of them have separate rails and you can overload one
rail and not the other.

I have also seen certain cheaper (non-Intel, non-AMD) sata chipsets
lockup and take out all disks attached to that chipset (a M*rvel 6gbit
chipset), and smart commands seem to cause that chipset/pci board to
be extra unreliable, and I would lose all 4 ports on the card at the
same moment, but in this case I believe a reboot was required to fix
it.

On Sat, Mar 9, 2019 at 3:47 AM Kent Dorfman <kent.dorfman766@xxxxxxxxx> wrote:
>
> nothing useful in the dmesg, and offline means registered as failed
> under /proc/mdstat. or [UU..] instead of [UUUU].
> since the array was mounted at the time of failure, just read/write
> errors when attempting to access files on it.
>
> These are the drive IDs from my script that spawns smart test.
>
> "ata-ST2000DL003-9VT166_5YD4TB1X"
> "ata-ST2000DM001-1CH164_Z3408QTZ"
> "ata-WDC_WD20EARX-00PASB0_WD-WCAZA8323338"
> "ata-WDC_WD20EZRZ-00Z5HB0_WD-WCC4M4SH8A8V"
>
> WD and seagate 2TB drives.  nothing special about them.
>
> As far as low priority IO goes, if all four disks are doing the test
> concurrently and the raid driver is trying to do IO, then it seems to
> me that a raid operation timeout becomes a possibility.  if I
> understand correctly, then the smart test is in the driver firmware
> and is a scsi/ata command function sent to the drive.  Once it's
> started then the host system is out of the loop, right?
>
>
>
> On 3/9/19, Roman Mamedov <rm@xxxxxxxxxxx> wrote:
> > On Sat, 9 Mar 2019 01:09:18 -0500
> > Kent Dorfman <kent.dorfman766@xxxxxxxxx> wrote:
> >
> >> On two occasions I have experienced two of four slices in a four disk
> >> array going offline in the wee hours of the morning while doing heavy
> >> batch processing and concurrently running smart short tests on the
> >> drives via crontab.
> >>
> >> My crontab consecutively starts the short test on all drives, and
> >> since it backgrounds them, they all are running concurrently.
> >>
> >> I suspect that the linux implementation of software raid is not able
> >> to handle the heavy disk IO while all the drives are also doing their
> >> smart test.
> >>
> >> The first time this happened the first two stripes went offline, and
> >> the second time the other two went offline.  Luckily on both occasions
> >> I was able to reassemble the array (eight hour process) and did not
> >> lose any data.
> >
> > It will be difficult for anyone to guess what happened without a
> > corresponding
> > complete dmesg output to define what "crashes" or "went offline" meant
> > exactly. And also what are the drive models. For instance SMR-based drives
> > are
> > known to be having a hard time if you try to put a ton of concurrent load
> > on
> > them.
> >
> > SMART test by itself shouldn't cause issues, as it's a low priority IO for
> > the
> > drive, so in theory the host shouldn't even be aware that it is going on.
> >
> > --
> > With respect,
> > Roman
> >