On Tue, Jan 26, 2010 at 4:19 PM, Ryan Wagoner <rswagoner@xxxxxxxxx> wrote: > On Fri, Jan 22, 2010 at 11:32 AM, Goswin von Brederlow > <goswin-v-b@xxxxxx> wrote: >> Tim Bock <jtbock@xxxxxxxxxxxx> writes: >> >>> Hello, >>> >>> I built a raid-1 + lvm setup on a Dell 2950 in December 2008. The OS >>> disk (ubuntu server 8.04) is not part of the raid. Raid is 4 disks + 1 >>> hot spare (all raid disks are sata, 1TB Seagates). >>> >>> Worked like a charm for ten months, and then had some kind of disk >>> problem in October which drove the load average to 13. Initially tried >>> a reboot, but system would not come all of the way back up. Had to boot >>> single-user and comment out the RAID entry. System came up, I manually >>> failed/removed the offending disk, added the RAID entry back to fstab, >>> rebooted, and things proceeded as I would expect. Replaced offending >>> drive. >> >> If a drive goes crazy without actualy dying then linux can spend a >> long time trying to get something from the drive. The driver chip can >> go crazy or the driver itself can have a bug and lockup. All those >> things are below the raid level and if they halt your system then raid >> can not do anything about it. >> >> Only when a drive goes bad and the lower layers report an error to the >> raid level can raid cope with the situation, remove the drive and keep >> running. Unfortunately there seems to be a loose correlation between >> cost of the controler (chip) and the likelyhood of a failing disk >> locking up the system. I.e. the cheap onboard SATA chips on desktop >> systems do that more often than expensive server controler. But that >> is just a loose relationship. >> >> MfG >> Goswin >> >> PS: I've seen hardware raid boxes lock up too so this isn't a drawback >> of software raid. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > You need to be using drives designed for RAID use with TLER (time > limited error recovery). When the drive encounters an error instead of > attempting to read the data for an extended period of time it just > gives up so the RAID can take care of it. > > For example I had a SAS drive start to fail on a hardware RAID server. > Every time it hit a bad spot on the drive you could tell the system > would pause for a brief second as only that drive light was on. The > drive gave up and the RAID determined the correct data. It ran fine > like this until I was able to replace the drive the next day. > > Ryan > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Why doesn't the kernel issue a pessimistic alternate 'read' path (on the other drives needed to obtain the data) if the ideal method is late. It would be more useful for time-sensitive/worst case buffering to be able to customize when to 'give up' dynamically. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html