Now MD subsystem does a very good job at trying to recover a bad block on a disk, by re-writing its content (to force drive to reallocate the block in question) and verifying it's written ok. But I wonder if it's worth the effort to go further than that. Now, md can use bitmaps. And a bitmap can be used not only to mark "clean vs dirty" chunks, but also, for example, "good vs bad" chunks. That to say. If we discover a read error on one component of an array, we tried to re-write it but rewrite (or reread) failed. Now current implementation will kick the bad drive from the array. But here it is possible to not kick it, but to turn corresponding bit(s) in the bitmap that says the data on this location on this drive is wrong, don't try to read from it. And continue using this drive as before (modulo the bits/parts just turned on). The rationale is -- each time we kick the whole drive from an array, for whatever reason, -- we greatly reduce chances of the whole array to be in working condition. For some reason, drives from the same batch tend to discover bad blocks close to each other - i mean, we see a bad block on one drive, and pretty soon we see another bad block on another drive (at least from our expirience). So by kicking one drive, we increase failure probability even more. We had a large batch of seagate 36g scsi drives, which all has some issue with firmware -- each time a drive detects a bad sector, and we try to mitigate it (by rewriting it), the drive reports "defect list manipulation error" (I don't remember exact sense code), and only on second attempt it rewrites the sector in question successefully. Seagate refused to acknowlege this problem, no matter how we argued -- they said it's "mishandling" (like, we improperly handled the drives). That to show just one example of numerous cases when such kicking of the whole drive is not good idea. Even more. If we see *read* error, there's no need to mark this chunk as "bad" in the bitmap -- only if we see *write* error while writing some *new* data. Ie, that "bad" bit in the bitmap may mean "data at this place is out of sync", like "extended dirty". When interpreted like that, there's no need to allocate new bit, but existing "dirty" bit can be used. On resync, we try to write again, and just keep that "dirty" bit if write failed. Obviously, we should not try to read from those "dirty" places. And if there's no components left to read from, just return read error - for this single read, but continue running the array (maybe in read-only mode, whatever). It seems like it's pretty simple to implement with existing code. The only requiriment is to have a bitmap - obviously, without the bitmap the whole idea does not work. This fits perfectly the "policy does not belong to the kernel" model as well. Never, ever, try to do something "large" (like kicking off the whole disk), but let userspace to descide what to do... Mdadm event handlers (scripts called when something goes wrong) can kick the disk off just fine. Comments, anyone? /mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html