Re: Fw: Why does one get mismatches?

Jon Hardcastle <jd_hardcastle@xxxxxxxxx> · Mon, 25 Jan 2010 02:07:11 -0800 (PST)

--- On Sun, 24/1/10, Goswin von Brederlow <goswin-v-b@xxxxxx> wrote:

> From: Goswin von Brederlow <goswin-v-b@xxxxxx>
> Subject: Re: Fw: Why does one get mismatches?
> To: Jon@xxxxxxxxxxxxxxx
> Cc: "Goswin von Brederlow" <goswin-v-b@xxxxxx>, linux-raid@xxxxxxxxxxxxxxx
> Date: Sunday, 24 January, 2010, 23:13
> Jon Hardcastle <jd_hardcastle@xxxxxxxxx>
> writes:
> 
> > --- On Fri, 22/1/10, Goswin von Brederlow <goswin-v-b@xxxxxx>
> wrote:
> >
> >> From: Goswin von Brederlow <goswin-v-b@xxxxxx>
> >> Subject: Re: Fw: Why does one get mismatches?
> >> To: Jon@xxxxxxxxxxxxxxx
> >> Cc: linux-raid@xxxxxxxxxxxxxxx
> >> Date: Friday, 22 January, 2010, 18:13
> >> Jon Hardcastle <jd_hardcastle@xxxxxxxxx>
> >> writes:
> >> 
> >> > --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@xxxxxxxxx>
> >> wrote:
> >> >
> >> >> From: Jon Hardcastle <jd_hardcastle@xxxxxxxxx>
> >> >> Subject: Why does one get mismatches?
> >> >> To: linux-raid@xxxxxxxxxxxxxxx
> >> >> Date: Tuesday, 19 January, 2010, 10:04
> >> >> Hi,
> >> >> 
> >> >> I kicked off a check/repair cycle on my
> machine
> >> after i
> >> >> moved the phyiscal ordering of my drives
> around
> >> and I am now
> >> >> on my second check/repair cycle and it
> has kept
> >> finding
> >> >> mismatches.
> >> >> 
> >> >> Is it correct that the mismatch value
> after a
> >> repair was
> >> >> needed should equal the value present
> after a
> >> check? What if
> >> >> it doesn't? What does it mean if another
> check
> >> STILL reveals
> >> >> mismatches?
> >> >> 
> >> >> I had something similar after i reshaped
> from raid
> >> 5 to 6 i
> >> >> had to run check/repair/check/repair
> several times
> >> before i
> >> >> got my 0.
> >> >> 
> >> >> 
> >> >
> >> > Guys,
> >> >
> >> > Anyone got any suggestions here? I am now on
> my ~5
> >> check/repair and after a reboot the first check is
> still
> >> returning 8.
> >> >
> >> > All i have done is move the drives around. It
> is the
> >> same controllers/cables/etc 
> >> >
> >> > I really dont like the seeming random nature
> of what
> >> can/does/has caused the mismatches?
> >> 
> >> There is some unknown corruption going on with
> raid1 that
> >> causes
> >> mismatches but it is believed that it will never
> occur on
> >> any used
> >> block. Swapping is a likely cause.
> >> 
> >> Any swap device on the raid? Try turning that
> off.
> >> If that doesn't help try umounting filesystems or
> >> remounting RO.
> >> 
> >> MfG
> >>         Goswin
> >
> > Hello, my usual savior Goswin!
> >
> > The deal is it is a 7 drive raid 6 array. it has LVM
> on it and is not used for swapping. I have umounted all LV's
> and still got mismatches, i run smartctl --test=long on all
> drives - nothing. I have now dismantled the array and am 3/4
> the way through 'badblocks -svn' on each of the component
> drive. I have a hunch that it may be a dodgy SATA cable but
> have no evidence. No errors in log, nothing on dmesg.
> >
> > Is there any way to get more information? I am
> starting to think this is more happened since i changed from
> raid 5 to 6..... which i did < 1 month ago.
> >
> > The only lead i have is that whilst doing the bad
> blocks 1 drive ran at ~10~15MB/s whereas the rest are going
> at ~30 i have another identical model drive coming up so i
> will see if that one is slow too. But the lack of logging
> info is not helpful and worrying! and the prospect of silent
> corruption a big worry!
> 
> You did run a repair pass and not just repeated check
> passes, right?
> Check itself only counts the mismatches but does not
> correct them.
> If the raid is unused (vgchange -a n) and you do first
> repair and then
> check then that definetly should not find any mismatches.
> 
> MfG
> 
>         Goswin

> 

Hello!

Yes, I have a simple script that first does a check, then if there are mismatches it does repair. I have then been manually rerunning a check and I keep getting mismatches. I goes like this 232, 8, 24, 8, 8, 16, 16, 24, 24, 8, 16, 24. But I have also done this manually and run several repairs in a row (assuming that will return 0 if no work is to be done)

Now the array is completely dismantled and I am running bad blocks on the drives but I am on the last 2 of the 7 drives and I still have no leads. No bad blocks, no offline uncorrectable, no pending sectors no dmesg errors no nothing. I have absolutely no leads what so ever.

The only thing i have left to try is a full Mem test and disconnect and reseat the additional sata controllers, oh and buy 7 new sata cables incase 1 is bad.

But it would be REALLY helpful to know on what drive the mismatches have occured.

Any help here would be gratefully received! I might even try converting the array back to raid 5 as i remember i had mismatches immediately after i converted from 5 to 6.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html