On Saturday November 16, jbass@dmsd.com wrote: > On first error the system currently appears to just abandon a drive, forcing > the system into degraded mode for all I/O which follows. A much more reasonable > approach would be to not abandon the drive completely, but rather build a fast > lookup table with known bad blocks which would allow accesses to most areas of > the array to continue without degradation, and only areas that have bad blocks > would be forced into degraded mode. > > Many drives will trash a sector if power drops when writing, and that sector > will generate read errors until written. It makes sense on those drives to > recover the data in degraded mode, and re-write followed by a verify. If the > verify fails, and the drive support dynamic sparing/remapping the sector > should be remapped, rewritten, and verified again. On a large 200GB arry, this > single feature would remove nearly a day of reconstruction time for normal > errors and sector failures, substantially improving realized reliability. > > Doing dynamic error management would remove 99% of the gross software raid > device failures I have seen over the last year. You are largely correct... I look forward to you providing (or sponsoring) code to do this. :-) Maybe this should go on a FAQ as it does get mentioned from time to time. The answer is: Yes, it could be done. No, it hasn't been done. Patches are always welcome. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html