Re: raid failure question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



- bad block on surviving disk during a rebuild gives partial unrecoverable data loss.
- bugs in firmware can be devastating, like NCQ TCQ problems,
-Humans

-----Original Message-----

From:  "Bill Davidsen" <davidsen@xxxxxxx>
Subj:  Re: raid failure question
Date:  Mon Feb 1, 2010 2:21 pm
Size:  1K
To:  "linux-raid@xxxxxxxxxxxxxxx" <linux-raid@xxxxxxxxxxxxxxx>

Robin Hill wrote: 
> On Mon Jan 11, 2010 at 11:00:40AM -0700, Tim Bock wrote: 
> 
>    
>> Hello, 
>> 
>> Excluding the obvious multi-disk or bus failures, can anyone describe 
>> what type of disk failure a raid cannot detect/recover from? 
>> 
>> I have had two disk failures over the last three months, and in spite of 
>> having a hot spare, manual intervention was required each time to make 
>> the raid usable again.  I'm just not sure if I'm not setting something 
>> up right, or if there is some other issue. 
>> 
>> Thanks for any comments or suggestions. 
>> 
>>      
> Any failure where the disk doesn't actually return an error (within a 
> reasonable time).  For example, consumer grade disks often have very 
> long retry times - this can mean the array in unusable for a long time 
> until the disk eventually fails the read. 
> 
> If the disk actually returns an error then, AFAIK, the RAID array should 
> always be able to recover from it. 
>    
 
The problem is that the admin should be able to set a timeout after  
which recovery takes place even if the drive hasn't returned a bad  
status. And some form of counter could be kept such that after a number  
of these the drive is failed. There is no solution, Neil says the  
timeout should be in the driver, the driver writers say that if it hurts  
md the timeout should be there. Everyone points the finger at some other  
code and says "there." 
 
This is not lazyness or buck passing, Neil feels that md is not the  
place, but putting it elsewhere causes other problems. Until someone  
says "perfect is the enemy of good enough" and puts a timer where it  
will solve the problem, this behavior will continue. 
 
--  
Bill Davidsen <davidsen@xxxxxxx> 
  "We can't solve today's problems by using the same thinking we 
   used in creating them." - Einstein 
 
-- 
To unsubscribe from this list: send the line "unsubscribe linux-raid" in 
the body of a message to majordomo@xxxxxxxxxxxxxxx 
More majordomo info at  http://vger.kernel.org/majordomo-info.html 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux