I CC'ed linux-ide to see if they think the reported error was really innocent: Question: does this error report suggest that a disk could be corrupted? This SATA disk is part of an md raid and no error was reported by md. [937567.332751] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4190002 action 0x2 [937567.354094] ata3.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in [937567.354096] res 51/04:83:45:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error) [937568.120783] ata3: soft resetting port [937568.282450] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [937568.306693] ata3.00: configured for UDMA/100 [937568.319733] ata3: EH complete [937568.361223] SCSI device sdc: 625142448 512-byte hdwr sectors (320073 MB) [937568.397207] sdc: Write Protect is off [937568.408620] sdc: Mode Sense: 00 3a 00 00 [937568.453522] SCSI device sdc: write cache: enabled, read cache: enabled, doesn't support DPO or FUA Neil Brown wrote: > On Saturday February 24, eyal@xxxxxxxxxxxxxx wrote: > >>But is this not a good opportunity to repair the bad stripe for a very >>low cost (no complete resync required)? > > > In this case, 'md' knew nothing about an error. The SCSI layer > detected something and thought it had fixed it itself. Nothing for md > to do. I expected this. So either the scsi layer incorrectly held back the error report of the mismatch_cnt is due to something unrelated to the disk i/o failure. >>At time of error we actually know which disk failed and can re-write >>it, something we do not know at resync time, so I assume we always >>write to the parity disk. Again, as I expected, resync cannot correct a problem, effectively "blaming" the parity block. To know which block to correct one needs a higher level parity code (can raid6 correct single bit/disk read errors?). > md only knows of a 'problem' if the lower level driver reports one. > If it reports a problem for a write request, md will fail the device. > If it reports a problem for a read request, md will try to over-write > correct data on the failed block. > But if the driver doesn't report the failure, there is nothing md can > do. > > When performing a check/repair md looks for consistencies and fixes > the 'arbitrarily'. For raid5/6, it just 'corrects' the parity. For > raid1/10, it chooses one block and over-writes the other(s) with it. > > Mapping these corrections back to blocks in files in the filesystem is > extremely non-trivial. > > NeilBrown -- Eyal Lebedinsky (eyal@xxxxxxxxxxxxxx) <http://samba.org/eyal/> attach .zip as .dat - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html