Re: impact of one slow drive?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




-----Original Message-----

From:  "Miles Fidelman" <mfidelman@xxxxxxxxxxxxxxxx>
Subj:  impact of one slow drive?
Date:  Sun Jun 27, 2010 8:31 pm
Size:  2K
To:  "linux-raid@xxxxxxxxxxxxxxx" <linux-raid@xxxxxxxxxxxxxxx>

Hi Folks, 
 
I've been slowly tracking down a problem on one of my servers - after  
prolonged periods of high disk activity, iowait goes to 100% and things  
slow down to a crawl. 
 
The server has 4 drives, with various software RAID (md) built on top of  
them.  After a LOT of digging, I've come to the conclusion that one of  
the drives gets flaky under prolonged stress. 
 
I'll find out for sure, when I plug in a replacement (in the mail) - but  
meanwhile.... 
 
What seems to be weird about this is that the drive passes all kinds of  
normal tests: 
- the SMART long test comes back fine, way within tolerances 
- hdparm -tT comes back with values comparable to other drives 
- badblocks reports nothing wrong (but seems to take a LONG time to  
complete) 
- the only diagnostic that indicates anything odd is atop - under  
sustained load, the one drive reports a higher percentage of busy time,  
and seems to have a longer avio 
 
I recall from previous experience, with a failing drive, that the drive  
also tested fine - but seemed to have very high access delays as  
compared to other drives (I can't remember what tool I used to measure  
it) -- which led me to surmise that something, between the disk platter,  
and higher level software, was exhibiting one of two failure modes: 
a) very high delay, or, 
b) required multiple retries, but ultimately came back with a proper  
response 
Either way, the performance of one drive dragged down the entire system. 
 
Which leads to a few questions: 
 
- what tools are there for detailed diagnostics (timing, failure rates,  
etc.) - I can't seem to find anything other than hdparm and smartctl -  
which aren't showing anything 
 
- shouldn't something in hardware/software chain detect long delays,  
mark a disk as failing, and take it out of a RAID array? 
 
- are there some metrics that md tracks that might be relevant, and if  
so, how does one get at them? 
 
- are there some relevant tuning parameters for md? 
 
Thanks very much, 
 
Miles Fidelman 
 
 
--  
In theory, there is no difference between theory and practice. 
In<fnord>  practice, there is.   .... Yogi Berra 
 
 based on symptoms, your disk is frequently in deep recovery cycle. I.e. It tries for 5-20+ seconds to recover data.

Fix is really to get enterprise drives which don't have this problem to begin with, after replacing drive... A patch is to run full media reads often to force remapping of hard-to-read blocks.  
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux