RE: Why does one get mismatches?

Jon Hardcastle <jd_hardcastle@xxxxxxxxx> · Thu, 28 Jan 2010 01:16:28 -0800 (PST)

> -----Original Message-----
> From: linux-raid-owner@xxxxxxxxxxxxxxx
> [mailto:linux-raid-owner@xxxxxxxxxxxxxxx]
> On Behalf Of Steven Haigh
> Sent: Monday, January 25, 2010 2:49 PM
> To: linux-raid@xxxxxxxxxxxxxxx
> Subject: Re: Why does one get mismatches?
> 
> 
> On 26/01/2010, at 7:43 AM, greg@xxxxxxxxxxxx
> wrote:
> 
> > On Jan 21, 12:48pm, Farkas Levente wrote:
> > } Subject: Re: Why does one get mismatches?
> > 
> > Good afternoon to everyone, hope the week is starting
> well.
> > 
> >> On 01/21/2010 11:52 AM, Steven Haigh wrote:
> >>> On Thu, 21 Jan 2010 09:08:42 +0100, Asdo<asdo@xxxxxxxxxxxxx> 
> wrote:
> >>>> Steven Haigh wrote:
> >>>>> On Wed, 20 Jan 2010 17:43:45 -0500,
> Brett Russ<bruss@xxxxxxxxxxx>
> >>> wrote:
> >>>>> 
> >>>>> CUT!
> >>>> Might that be a problem of the
> disks/controllers?
> >>>> Jon and Steven, what hardware do you
> have?
> >>> 
> >>> I'm running some fairly old hardware on this
> particular server. It's
> a
> >>> dual P3 1Ghz.
> >>> 
> >>> After running a repair on /dev/md2, I now
> see:
> >>> # cat /sys/block/md2/md/mismatch_cnt
> >>> 1536
> >>> 
> >>> Again, no smart errors, nothing to indicate a
> disk problem at all :(
> >>> 
> >>> As this really keeps killing the machine and
> it is a live system -
> the
> >>> only thing I can really think of doing is to
> break the RAID and just
> rsync
> >>> the drives twice daily :\
> > 
> >> the same happened with many people. and we all
> hate it since it
> >> cause a huge load at all weekend on most of our
> servers:-( according
> >> to redhat it's not a bug:-(
> > 
> > The RAID check/mismatch_count is an example of well
> intentioned
> > technology suffering from 'featuritis' by the
> distributions which is,
> > as I predicted a couple of times in this forum,
> causing all sorts of
> > angst and problems throughout the world.  I've
> had some posts on this
> > subject but will summarize in the hopes of giving some
> background
> > information which will be useful to people.
> > 
> > There is an issue in the kernel which causes these
> mismatches.  The
> > problem seems to be particularly bad with RAID1
> arrays.  The
> > contention is that these mismatches are 'harmless'
> because they only
> > occur in areas of the filesystems which are not being
> used.
> > 
> > The best description is that the buffers containing
> the data to be
> > written are not 'pinned' all the way down the I/O
> stack.  This can
> > cause the contents of a buffer to be changed while in
> transit through
> > the I/O stack.  Thus one copy of a mirror gets a
> buffer written to it
> > different then the other side of the mirror.
> > 
> > I've read reasoned discussions about why this occurs
> with swap over
> > RAID1 and why its harmless.  I've set to see the
> same type of reasoned
> > discussion as to why it is not problematic with a
> filesystem over
> > RAID1.  There has been some discussion that its
> due to high levels of
> > MMAP activity on the filesystem.
> > 
> > We have confirmed, that at least with RAID1, this all
> occurs with no
> > physical corruption on the 'disk drives'.  We
> implement geographically
> > mirror storage with RAID1 against two separate
> data-centers.  At each
> > data-center the RAID1 'block-device' are RAID5
> volumes.  These latter
> > volumes check out with no errors/mismatch counts
> etc.  So the issue is
> > at the RAID1 data abstraction layer.
> > 
> > There do not appear to be any tools which allow one to
> determine
> > 'where' the mismatches are.  Such a tool, or
> logging by the kernel,
> > would be useful for people who want to verify what
> files, if any, are
> > affected by the mismatch.  Otherwise running a
> 'repair' results in the
> > RAID1 code arbitraily deciding which of the two blocks
> is the
> > 'correct' one.
> > 
> > So thats sort of a thumbnail sketch of what is going
> on.  The fact the
> > distributions chose to implement this without
> understanding the issues
> > it presents is a bit problematic.
> > 
> >>   Levente     
>                
>          "Si vis pacem para
> bellum!"
> > 
> > Hopefully this information is helpful.
> > 
> > Greg
> 
> Hi Greg and all,
> 
> The funny part is that I believe the mismatches aren't
> happening in the
> empty space of the filesystem - as it seems that the errors
> are causing
> the ext3 journal to abort and force the filesystem into
> readonly in my
> particular situation.
> 
> It is interesting that I do not get any mismatches on md0,
> md1 or md3 -
> only md2.
> 
> md0 = /boot
> md1 = swap
> md2 = /
> md3 = /tmp
> 
> I ran weekly checks on the all four RAID1 arrays and ONLY
> md2 had a
> problem with mismatches, which also had a habit of going
> readonly -
> therefore I don't believe the part of common belief that
> this problem
> only affects empty parts of the filesystem.
> 
> I have also done just about every test to the disks that I
> can think of
> with no errors to be found - leaving only the md layer to
> be suspect.
> 
> --
> Steven Haigh
> 
> Email: netwiz@xxxxxxxxx
> Web: http://www.crc.id.au
> Phone: (03) 9001 6090 - 0412 935 897
> Fax: (03) 8338 0299
> 

Well, I finished running my none-destructive badblocks check and ran several smart --long tests I also did a forcefsk on the bad boy and NOW the active md4 (with a DEACTIVE vg on it) returns 0 mismatch_cnt. I haven't rebooted it in days though so I just dont know what casued this. No errors in the log, the pending/reallocated sector count is still 0 on all drives.

I have reactivated my VG and am running it again now it is just bizzare.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html