On Thu, 7 Apr 2011 03:45:05 -0400 Robin Humble <robin.humble+raid@xxxxxxxxxx> wrote: > On Tue, Apr 05, 2011 at 03:00:22PM +1000, NeilBrown wrote: > >On Mon, 4 Apr 2011 09:59:02 -0400 Robin Humble <robin.humble+raid@xxxxxxxxxx> > >> we are finding non-zero mismatch_cnt's and getting data corruption when > >> using RHEL5/CentOS5 kernels with md raid6. > >> actually, all kernels prior to 2.6.32 seem to have the bug. > >> > >> the corruption only happens after we replace a failed disk, and the > >> incorrect data is always on the replacement disk. i.e. the problem is > >> with rebuild. mismatch_cnt is always a multiple of 8, so I suspect > >> pages are going astray. > ... > >> git bisecting through drivers/md/raid5.c between 2.6.31 (has mismatches) > >> and .32 (no problems) says that one of these (unbisectable) commits > >> fixed the issue: > >> a9b39a741a7e3b262b9f51fefb68e17b32756999 md/raid6: asynchronous handle_stripe_dirtying6 > >> 5599becca4bee7badf605e41fd5bcde76d51f2a4 md/raid6: asynchronous handle_stripe_fill6 > >> d82dfee0ad8f240fef1b28e2258891c07da57367 md/raid6: asynchronous handle_parity_check6 > >> 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8 md/raid6: asynchronous handle_stripe6 > >> > >> any ideas? > >> were any "write i/o whilst rebuilding from degraded" issues fixed by > >> the above patches? > > > >It looks like they were, but I didn't notice at the time. > > > >If a write to a block in a stripe happens at exactly the same time as the > >recovery of a different block in that stripe - and both operations are > >combined into a single "fix up the stripe parity and write it all out" > >operation, then the block that needs to be recovered is computed but not > >written out. oops. > > > >The following patch should fix it. Please test and report your results. > >If they prove the fix I will submit it for the various -stable kernels. > >It looks like this bug has "always" been present :-( > > thanks for the very quick reply! > > however, I don't think the patch has solved the problem :-/ > I applied it to 2.6.31.14 and have got several mismatches since on both > FC and SATA machines. That's disappointing - I was sure I had found it. I'm tempted to ask "are you really sure you are running the modified kernel", but I'm sure you are. > > BTW, these tests are actually fairly quick. often <10 kickout/rebuild > loops. so just a few hours. > > in the above when you say "block in a stripe", does that mean the whole > 128k on that disk (we have --chunk=128) might not have been written, or > one 512byte block (or a page)? 'page'. raid5 does everything in one-page (4K) per device strips. So mismatch count - which is measured in sectors - will always be a multiple of 8. > we don't see mismatch counts of 256 - usually 8 or 16, but I can see > why our current testing might hide such a count. <later> ok - now I > blank (dd zeros over) the replacement disk before putting it back into > the raid and am seeing perhaps slightly larger typical mismatch counts > of 16 and 32, but so far not 128k of mismatches. > > another (potential) data point is that often position of the mismatches > on the md device doesn't really line up with where I think the data is > being written to. the mismatch is often near the start of the md device, > but sometimes 50% of the way in, and sometimes 95%. > the filesystem is <20% full, although the wildcard is that I really > have no idea how the (sparse) files doing the 4k direct i/o are > allocated across the filsystem, or where fs metadata and journals might > be updating blocks either. > seems odd though... I suspect the filesystem spreads files across the whole disk, though it depends a lot on the details of the particular filesystem. > > >Thanks for the report .... and for all that testing! A git-bisect where each > >run can take 36 hours is a really test of commitment!!! > > :-) no worries. thanks for md and for the help! > we've been trying to figure this out for months so a bit of testing > isn't a problem. we first eliminated a bunch of other things > (filesystem, drivers, firmware, ...) as possibilities. for a long time > we didn't really believe the problem could be with md as it's so well > tested around the world and has been very solid for us except for these > rebuilds. > I'll try staring at the code a bit longer and see if anything jumps out at me. thanks, NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html