Re: Read errors on raid5 ignored, array still clean .. then disaster !!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Giovanni Tessore wrote:

RAID-5 unfortunately is inherently insecure, here is why:
If one drive gets kicked, MD starts recovering to a spare.
At that point any single read error during the regeneration (that's a scrub) will fail the array.
This is a problem that cannot be overcome in theory.
Yes, I was just getting the same conclusion :-(
Suppose you have 2Tb mainstream disks, with a read error ratio of 1 sector each 1E+14 bits = 1.25E+13 bytes. It means that you likely get an error every 6.25 times you read the whole disk! So in case of failure of a disk, you have 1 possibility over 6 to fail the array during recostruction.
Simply unacceptable!

I looked at specs of some enterprise disks, and the read error ratio for them is 1 sector each 1E+15 or each 1E+16. Better but still risky.
Ok.. I'll definitely move to raid-6.

Also raid-1 with less than 3 disks becomes useless the same way :-(
Idem for raid-10 ...wow

Well, these two threads on read errors came out as kinda instructive ... doh!!

Regards


The manufacturer error numbers don't mean much, typically good disks won't fail a rebuild that often, I have done alot of rebuilds and read errors during a rebuild are fairly rare, especially if you are doing proper patrol reads/scans of the raid arrays, if you have disks setting for long periods of time without read scans, then all bets are off and you will have issues.

I have never seen a properly good disk that gets that high of error rate actually exposed to the OS. I have dealt with >5000 disk for several years of history on the 5000+ disks.

I have seen a few manufacturer "lots" of disks that had seriously high error rates (certain sizes and certain manufacture ranges), in one set of disks (2000 desktop drives, and 600 "enterprise" drives) the desktop drives were almost perfect (same company as the "enterprise" disks, <10 replaced after 2 years, out of the 600 "enterprise" disks we had replaced about 50 (read errors) when we finally got the manufacture send a engineer on-site to validate what was happening and RMA the entire lot (all 550 disk that had not yet been replaced).

Nothing in the error rate indicated that behavior, so if you get a bad lot it will be very bad, if you don't get a bad lot you very likely won't have issues. Now including the bad lots data into the overall error rate, may result in the error rate being that high, but you luck will depend on if you have a good or bad lot.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux