Re: rhel5 raid6 corruption

NeilBrown <neilb@xxxxxxx> · Fri, 8 Apr 2011 20:33:45 +1000

On Thu, 7 Apr 2011 03:45:05 -0400 Robin Humble <robin.humble+raid@xxxxxxxxxx>
wrote:

> On Tue, Apr 05, 2011 at 03:00:22PM +1000, NeilBrown wrote:
> >On Mon, 4 Apr 2011 09:59:02 -0400 Robin Humble <robin.humble+raid@xxxxxxxxxx>
> >> we are finding non-zero mismatch_cnt's and getting data corruption when
> >> using RHEL5/CentOS5 kernels with md raid6.
> >> actually, all kernels prior to 2.6.32 seem to have the bug.
> >> 
> >> the corruption only happens after we replace a failed disk, and the
> >> incorrect data is always on the replacement disk. i.e. the problem is
> >> with rebuild. mismatch_cnt is always a multiple of 8, so I suspect
> >> pages are going astray.
> ...
> >> git bisecting through drivers/md/raid5.c between 2.6.31 (has mismatches)
> >> and .32 (no problems) says that one of these (unbisectable) commits
> >> fixed the issue:
> >>   a9b39a741a7e3b262b9f51fefb68e17b32756999  md/raid6: asynchronous handle_stripe_dirtying6
> >>   5599becca4bee7badf605e41fd5bcde76d51f2a4  md/raid6: asynchronous handle_stripe_fill6
> >>   d82dfee0ad8f240fef1b28e2258891c07da57367  md/raid6: asynchronous handle_parity_check6
> >>   6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8  md/raid6: asynchronous handle_stripe6
> >> 
> >> any ideas?
> >> were any "write i/o whilst rebuilding from degraded" issues fixed by
> >> the above patches?
> >
> >It looks like they were, but I didn't notice at the time.
> >
> >If a write to a block in a stripe happens at exactly the same time as the
> >recovery of a different block in that stripe - and both operations are
> >combined into a single "fix up the stripe parity and write it all out"
> >operation, then the block that needs to be recovered is computed but not
> >written out.  oops.
> >
> >The following patch should fix it.  Please test and report your results.
> >If they prove the fix I will submit it for the various -stable kernels.
> >It looks like this bug has "always" been present :-(
> 
> thanks for the very quick reply!
> 
> however, I don't think the patch has solved the problem :-/
> I applied it to 2.6.31.14 and have got several mismatches since on both
> FC and SATA machines.

That's disappointing - I was sure I had found it.  I'm tempted to ask "are
you really sure you are running the modified kernel", but I'm sure you are.

> 
> BTW, these tests are actually fairly quick. often <10 kickout/rebuild
> loops. so just a few hours.
> 
> in the above when you say "block in a stripe", does that mean the whole
> 128k on that disk (we have --chunk=128) might not have been written, or
> one 512byte block (or a page)?

'page'. raid5 does everything in one-page (4K) per device strips.
So mismatch count - which is measured in sectors - will always be a multiple
of 8.

> we don't see mismatch counts of 256 - usually 8 or 16, but I can see
> why our current testing might hide such a count. <later> ok - now I
> blank (dd zeros over) the replacement disk before putting it back into
> the raid and am seeing perhaps slightly larger typical mismatch counts
> of 16 and 32, but so far not 128k of mismatches.
> 
> another (potential) data point is that often position of the mismatches
> on the md device doesn't really line up with where I think the data is
> being written to. the mismatch is often near the start of the md device,
> but sometimes 50% of the way in, and sometimes 95%.
> the filesystem is <20% full, although the wildcard is that I really
> have no idea how the (sparse) files doing the 4k direct i/o are
> allocated across the filsystem, or where fs metadata and journals might
> be updating blocks either.
> seems odd though...

I suspect the filesystem spreads files across the whole disk, though it
depends a lot on the details of the particular filesystem.

> 
> >Thanks for the report .... and for all that testing!  A git-bisect where each
> >run can take 36 hours is a really test of commitment!!!
> 
> :-) no worries. thanks for md and for the help!
> we've been trying to figure this out for months so a bit of testing
> isn't a problem. we first eliminated a bunch of other things
> (filesystem, drivers, firmware, ...) as possibilities. for a long time
> we didn't really believe the problem could be with md as it's so well
> tested around the world and has been very solid for us except for these
> rebuilds.
> 

I'll try staring at the code a bit longer and see if anything jumps out at me.

thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html