Re: nonzero mismatch_cnt with no earlier error

Eyal Lebedinsky <eyal@xxxxxxxxxxxxxx> · Sat, 24 Feb 2007 11:59:15 +1100

But is this not a good opportunity to repair the bad stripe for a very
low cost (no complete resync required)?

At time of error we actually know which disk failed and can re-write
it, something we do not know at resync time, so I assume we always
write to the parity disk.

Justin Piszcz wrote:
> Should the raid have noticed the error, checked the offending
> stripe and taken appropriate action? The messages from that error
> are below.
> 
> I don't think so, that is why we need to run check every once and a
> while and check the mismatch_cnt file for each md raid device.
> 
> Run repair then re-run check to verify the count goes back to 0.
> 
> Justin.
> 
> On Sat, 24 Feb 2007, Eyal Lebedinsky wrote:
> 
>> I run a 'check' weekly, and yesterday it came up with a non-zero
>> mismatch count (184). There were no earlier RAID errors logged
>> and the count was zero after the run a week ago.
>>
>> Now, the interesting part is that there was one i/o error logged
>> during the check *last week*, however the raid did not see it and
>> the count was zero at the end. No errors were logged during the
>> week since or during the check last night.
>>
>> fsck (ext3 with logging) found no errors but I may have bad data
>> somewhere.
>>
>> Should the raid have noticed the error, checked the offending
>> stripe and taken appropriate action? The messages from that error
>> are below.
>>
>> Naturally, I do not know if the mismatch is related to the failure
>> last week, it could be from a number of other reasons (bad memory?
>> kernel bug?).
>>
>>
>> system details:
>>  2.6.20 vanilla
>>  /dev/sd[ab]: on motherboard
>>    IDE interface: Intel Corp. 82801EB (ICH5) Serial ATA 150 Storage
>> Controller (rev 02)
>>  /dev/sd[cdef]: Promise SATA-II-150-TX4
>>    Unknown mass storage controller: Promise Technology, Inc.: Unknown
>> device 3d18 (rev 02)
>>  All 6 disks are WD 320GB SATA of similar models
>>
>> Tail of dmesg, showing all messages since last week 'check':
>>
>>     *** last week check start:
>> [927080.617744] md: data-check of RAID array md0
>> [927080.630783] md: minimum _guaranteed_  speed: 24000 KB/sec/disk.
>> [927080.648734] md: using maximum available idle IO bandwidth (but not
>> more than 200000 KB/sec) for data-check.
>> [927080.678103] md: using 128k window, over a total of 312568576 blocks.
>>     *** last week error:
>> [937567.332751] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4190002
>> action 0x2
>> [937567.354094] ata3.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0
>> cdb 0x0 data 512 in
>> [937567.354096]          res 51/04:83:45:00:00/00:00:00:00:00/a0 Emask
>> 0x10 (ATA bus error)
>> [937568.120783] ata3: soft resetting port
>> [937568.282450] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>> [937568.306693] ata3.00: configured for UDMA/100
>> [937568.319733] ata3: EH complete
>> [937568.361223] SCSI device sdc: 625142448 512-byte hdwr sectors
>> (320073 MB)
>> [937568.397207] sdc: Write Protect is off
>> [937568.408620] sdc: Mode Sense: 00 3a 00 00
>> [937568.453522] SCSI device sdc: write cache: enabled, read cache:
>> enabled, doesn't support DPO or FUA
>>     *** last week check end:
>> [941696.843935] md: md0: data-check done.
>> [941697.246454] RAID5 conf printout:
>> [941697.256366]  --- rd:6 wd:6
>> [941697.264718]  disk 0, o:1, dev:sda1
>> [941697.275146]  disk 1, o:1, dev:sdb1
>> [941697.285575]  disk 2, o:1, dev:sdc1
>> [941697.296003]  disk 3, o:1, dev:sdd1
>> [941697.306432]  disk 4, o:1, dev:sde1
>> [941697.316862]  disk 5, o:1, dev:sdf1
>>     *** this week check start:
>> [1530647.746383] md: data-check of RAID array md0
>> [1530647.759677] md: minimum _guaranteed_  speed: 24000 KB/sec/disk.
>> [1530647.778041] md: using maximum available idle IO bandwidth (but
>> not more than 200000 KB/sec) for data-check.
>> [1530647.807663] md: using 128k window, over a total of 312568576 blocks.
>>     *** this week check end:
>> [1545248.680745] md: md0: data-check done.
>> [1545249.266727] RAID5 conf printout:
>> [1545249.276930]  --- rd:6 wd:6
>> [1545249.285542]  disk 0, o:1, dev:sda1
>> [1545249.296228]  disk 1, o:1, dev:sdb1
>> [1545249.306923]  disk 2, o:1, dev:sdc1
>> [1545249.317613]  disk 3, o:1, dev:sdd1
>> [1545249.328292]  disk 4, o:1, dev:sde1
>> [1545249.338981]  disk 5, o:1, dev:sdf1
>>
>> -- 
>> Eyal Lebedinsky (eyal@xxxxxxxxxxxxxx) <http://samba.org/eyal/>
>>     attach .zip as .dat
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

-- 
Eyal Lebedinsky (eyal@xxxxxxxxxxxxxx) <http://samba.org/eyal/>
	attach .zip as .dat
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html