Re: nonzero mismatch_cnt with no earlier error

Eyal Lebedinsky <eyal@xxxxxxxxxxxxxx> · Mon, 26 Feb 2007 19:18:45 +1100

I CC'ed linux-ide to see if they think the reported error was really innocent:

Question: does this error report suggest that a disk could be corrupted?

This SATA disk is part of an md raid and no error was reported by md.

[937567.332751] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4190002 action 0x2
[937567.354094] ata3.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in
[937567.354096]          res 51/04:83:45:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
[937568.120783] ata3: soft resetting port
[937568.282450] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[937568.306693] ata3.00: configured for UDMA/100
[937568.319733] ata3: EH complete
[937568.361223] SCSI device sdc: 625142448 512-byte hdwr sectors (320073 MB)
[937568.397207] sdc: Write Protect is off
[937568.408620] sdc: Mode Sense: 00 3a 00 00
[937568.453522] SCSI device sdc: write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Neil Brown wrote:
> On Saturday February 24, eyal@xxxxxxxxxxxxxx wrote:
> 
>>But is this not a good opportunity to repair the bad stripe for a very
>>low cost (no complete resync required)?
> 
> 
> In this case, 'md' knew nothing about an error.  The SCSI layer
> detected something and thought it had fixed it itself.  Nothing for md
> to do.

I expected this. So either the scsi layer incorrectly held back the error
report of the mismatch_cnt is due to something unrelated to the disk
i/o failure.

>>At time of error we actually know which disk failed and can re-write
>>it, something we do not know at resync time, so I assume we always
>>write to the parity disk.

Again, as I expected, resync cannot correct a problem, effectively
"blaming" the parity block. To know which block to correct one needs
a higher level parity code (can raid6 correct single bit/disk read
errors?).

> md only knows of a 'problem' if the lower level driver reports one.
> If it reports a problem for a write request, md will fail the device.
> If it reports a problem for a read request, md will try to over-write
> correct data on the failed block. 
> But if the driver doesn't report the failure, there is nothing md can
> do.
> 
> When performing a check/repair md looks for consistencies and fixes
> the 'arbitrarily'.  For raid5/6, it just 'corrects' the parity.  For
> raid1/10, it chooses one block and over-writes the other(s) with it.
> 
> Mapping these corrections back to blocks in files in the filesystem is
> extremely non-trivial.
> 
> NeilBrown

-- 
Eyal Lebedinsky (eyal@xxxxxxxxxxxxxx) <http://samba.org/eyal/>
	attach .zip as .dat
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html