impact of one slow drive?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Folks,

I've been slowly tracking down a problem on one of my servers - after prolonged periods of high disk activity, iowait goes to 100% and things slow down to a crawl.

The server has 4 drives, with various software RAID (md) built on top of them. After a LOT of digging, I've come to the conclusion that one of the drives gets flaky under prolonged stress.

I'll find out for sure, when I plug in a replacement (in the mail) - but meanwhile....

What seems to be weird about this is that the drive passes all kinds of normal tests:
- the SMART long test comes back fine, way within tolerances
- hdparm -tT comes back with values comparable to other drives
- badblocks reports nothing wrong (but seems to take a LONG time to complete) - the only diagnostic that indicates anything odd is atop - under sustained load, the one drive reports a higher percentage of busy time, and seems to have a longer avio

I recall from previous experience, with a failing drive, that the drive also tested fine - but seemed to have very high access delays as compared to other drives (I can't remember what tool I used to measure it) -- which led me to surmise that something, between the disk platter, and higher level software, was exhibiting one of two failure modes:
a) very high delay, or,
b) required multiple retries, but ultimately came back with a proper response
Either way, the performance of one drive dragged down the entire system.

Which leads to a few questions:

- what tools are there for detailed diagnostics (timing, failure rates, etc.) - I can't seem to find anything other than hdparm and smartctl - which aren't showing anything

- shouldn't something in hardware/software chain detect long delays, mark a disk as failing, and take it out of a RAID array?

- are there some metrics that md tracks that might be relevant, and if so, how does one get at them?

- are there some relevant tuning parameters for md?

Thanks very much,

Miles Fidelman


--
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux