Re: impact of one slow drive?

David Lethe <david@xxxxxxxxxxxx> · Sun, 27 Jun 2010 23:11:00 -0500

-----Original Message-----

From:  "Miles Fidelman" <mfidelman@xxxxxxxxxxxxxxxx>
Subj:  impact of one slow drive?
Date:  Sun Jun 27, 2010 8:31 pm
Size:  2K
To:  "linux-raid@xxxxxxxxxxxxxxx" <linux-raid@xxxxxxxxxxxxxxx>

Hi Folks, 

I've been slowly tracking down a problem on one of my servers - after  
prolonged periods of high disk activity, iowait goes to 100% and things  
slow down to a crawl. 

The server has 4 drives, with various software RAID (md) built on top of  
them.  After a LOT of digging, I've come to the conclusion that one of  
the drives gets flaky under prolonged stress. 

I'll find out for sure, when I plug in a replacement (in the mail) - but  
meanwhile.... 

What seems to be weird about this is that the drive passes all kinds of  
normal tests: 
- the SMART long test comes back fine, way within tolerances 
- hdparm -tT comes back with values comparable to other drives 
- badblocks reports nothing wrong (but seems to take a LONG time to  
complete) 
- the only diagnostic that indicates anything odd is atop - under  
sustained load, the one drive reports a higher percentage of busy time,  
and seems to have a longer avio 

I recall from previous experience, with a failing drive, that the drive  
also tested fine - but seemed to have very high access delays as  
compared to other drives (I can't remember what tool I used to measure  
it) -- which led me to surmise that something, between the disk platter,  
and higher level software, was exhibiting one of two failure modes: 
a) very high delay, or, 
b) required multiple retries, but ultimately came back with a proper  
response 
Either way, the performance of one drive dragged down the entire system. 

Which leads to a few questions: 

- what tools are there for detailed diagnostics (timing, failure rates,  
etc.) - I can't seem to find anything other than hdparm and smartctl -  
which aren't showing anything 

- shouldn't something in hardware/software chain detect long delays,  
mark a disk as failing, and take it out of a RAID array? 

- are there some metrics that md tracks that might be relevant, and if  
so, how does one get at them? 

- are there some relevant tuning parameters for md? 

Thanks very much, 

Miles Fidelman 

--  
In theory, there is no difference between theory and practice. 
In<fnord>  practice, there is.   .... Yogi Berra 

 based on symptoms, your disk is frequently in deep recovery cycle. I.e. It tries for 5-20+ seconds to recover data.

Fix is really to get enterprise drives which don't have this problem to begin with, after replacing drive... A patch is to run full media reads often to force remapping of hard-to-read blocks.  
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html