> On Sat, Nov 15, 2008 at 08:58:46AM +1100, Neil Brown wrote: >> On Friday November 14, greg@xxxxxxxxxxxx wrote: >> > Hi Neil, hope the week is ending well for you and the rest of the >> > denizens on the linux-raid list. >> > >> > Somewhat of a Gedanken question for you. >> > >> > We currently attempt a re-write on read error for volumes which have >> > redundancy, ie. RAID[156] etc, on the bet that we can force a bad >> > sector remap. Should we be attempting that (or do we) on a write >> > error as well? >> >> I don't think so. >> By the time md/raid gets an error status, lower levels (Whether driver >> or firmware) should have retried as much as in appropriate. Doing >> further retries at the md level should be pointless. >> >> For reads, we do retry. But the purpose is to find out exactly which >> block failed so that we can just re-write that block. There is no >> expectation that a block which previously failed a read will now >> succeed. >> >> Similarly there is no reason to expect that a block which previously >> failed a write will now succeed. >> >> I suggest that you might like to discuss your particular case with the >> author of the driver for the device. Maybe the driver should be >> retrying. Maybe the firmware is doing the wrong thing. >> >> After all, you wouldn't expect every different filesystem to retry all >> failed writes, would you? >> >> >> > >> > BTW much thanks for the existing re-write code. Countless mornings >> > I have said 'gee that Neil Brown was clever' when I see that one of >> > our machines cleaned up a potential problem before it became a bigger >> > one. >> :-) >> To be honest, that code was largely because people kept complaining >> about read errors being too fatal and wanted something done. The only >> way to stop the flood of complaints was to fix something :-) >> >> > >> > Best wishes for a pleasant weekend. >> >> And for you! >> >> NeilBrown <<Moved from the top post to a bottom post>> On Fri, Nov 14, 2008 at 7:47 PM, Keld Jørn Simonsen <keld@xxxxxxxx> wrote: > I would like to write something about this fo the wiki. > What exactly is done, and it is general for all of linux md raid? > > best regards > keld > If you are going to document this in a wiki, please document when a write error can occur because I totally don't understand how this one occurred. I thought they could only occur: 1) With bad media on the platter and the reallocatable sectors section was already 100% utilized 2) Due to a CRC error on the comm path. (flacky cable / power / etc.) As I read the below errors, neither of those occurred. And as Neil said I believe the retrys related to CRC errors should be handled below the MD level. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html