On 3/9/19, Leslie Rhorer <lrhorer@xxxxxxxxxxxx> wrote: > Oh, and BTW, you say you are losing two spindles, right? From your > response, I loosely infer it is either the two Seagate drives or else > the two Western Digital drives that are being dropped. Is that > correct? If so, doesn 't that suggest something to you? > > On a different note, you don't mention what RAID level you are > running, but given the fact the array croaks when two drives are > removed, I take it the array is RAID5 or less? I suggest at least RAID6 > and I would also suggest a hot spare. > > On 3/9/2019 8:41 AM, Leslie Rhorer wrote: >> I think you may have used the term "driver" twice when you meant >> to say, "Drive". The SMART operations are all contained fully within >> the hard drive hardware, so there is no direct interaction between >> RAID or any other host software and the SMART facilities, including >> the tests. In a properly functioning hard drive, SMART operations >> should not cause an I/O call timeout. I believe there are timeout >> traps in the ATA / SCSI I/O drivers, but I don't think there are any >> timeout traps in RAID. I could be wrong on that point, however. >> >> RAID itself is not going to be failure susceptible to bandwidth >> limitations of the CPU or the I/O subsystems. One could run RAID on a >> bunch of floppy disks, and it wouldn't care. I am guessing you have a >> hardware issue. Since the issue reportedly only shows up when SMART >> tests are running, I would suspect the hard drives themselves. >> >> Roger's suggestions are also reasonable. I would start by backing >> up all the data. I will say that again: back up all your data. Then >> I would try swapping hardware, starting with the drives, and then the >> power supply. How big is the power supply, by the way? Then I would >> try drive cables, followed by the HBA / Drive controller. Are you >> using an external drive enclosure, or are the drives internal to the PC? >> >> On 3/9/2019 3:43 AM, Kent Dorfman wrote: >>> nothing useful in the dmesg, and offline means registered as failed >>> under /proc/mdstat. or [UU..] instead of [UUUU]. >>> since the array was mounted at the time of failure, just read/write >>> errors when attempting to access files on it. >>> >>> These are the drive IDs from my script that spawns smart test. >>> >>> "ata-ST2000DL003-9VT166_5YD4TB1X" >>> "ata-ST2000DM001-1CH164_Z3408QTZ" >>> "ata-WDC_WD20EARX-00PASB0_WD-WCAZA8323338" >>> "ata-WDC_WD20EZRZ-00Z5HB0_WD-WCC4M4SH8A8V" >>> >>> WD and seagate 2TB drives. nothing special about them. >>> >>> As far as low priority IO goes, if all four disks are doing the test >>> concurrently and the raid driver is trying to do IO, then it seems to >>> me that a raid operation timeout becomes a possibility. if I >>> understand correctly, then the smart test is in the driver firmware >>> and is a scsi/ata command function sent to the drive. Once it's >>> started then the host system is out of the loop, right? >>> >>> >>> >>> On 3/9/19, Roman Mamedov <rm@xxxxxxxxxxx> wrote: >>>> On Sat, 9 Mar 2019 01:09:18 -0500 >>>> Kent Dorfman <kent.dorfman766@xxxxxxxxx> wrote: >>>> >>>>> On two occasions I have experienced two of four slices in a four disk >>>>> array going offline in the wee hours of the morning while doing heavy >>>>> batch processing and concurrently running smart short tests on the >>>>> drives via crontab. >>>>> >>>>> My crontab consecutively starts the short test on all drives, and >>>>> since it backgrounds them, they all are running concurrently. >>>>> >>>>> I suspect that the linux implementation of software raid is not able >>>>> to handle the heavy disk IO while all the drives are also doing their >>>>> smart test. >>>>> >>>>> The first time this happened the first two stripes went offline, and >>>>> the second time the other two went offline. Luckily on both occasions >>>>> I was able to reassemble the array (eight hour process) and did not >>>>> lose any data. >>>> It will be difficult for anyone to guess what happened without a >>>> corresponding >>>> complete dmesg output to define what "crashes" or "went offline" meant >>>> exactly. And also what are the drive models. For instance SMR-based >>>> drives >>>> are >>>> known to be having a hard time if you try to put a ton of concurrent >>>> load >>>> on >>>> them. >>>> >>>> SMART test by itself shouldn't cause issues, as it's a low priority >>>> IO for >>>> the >>>> drive, so in theory the host shouldn't even be aware that it is >>>> going on. >>>> >>>> -- >>>> With respect, >>>> Roman >>>> >> >> --- >> This email has been checked for viruses by Avast antivirus software. >> https://www.avast.com/antivirus >> > internal raid5, no room for a spare...drives are staggered. sda and sdc are WD. two drives on each power rail...the machine is a rack mount server. you don't get supplies with weak power rails on a milspec server, FWIW I looked closely at the max power utilization for each component and guaranteed that in each case I had an adequate safety margin on the power budget. My profession requires that I understand power budgets, startup current, flyback, noisy signals, etc. Been running for a month now with no hiccups since disabling the concurrent smart tests.