Re: mismatch_count != 0 on multiple hosts

Bryan Mesich <bryan.mesich@xxxxxxxx> · Fri, 18 Sep 2009 19:33:57 -0500

On Wed, Sep 16, 2009 at 09:20:35PM +0200, Mario 'BitKoenig' Holbe wrote:
> Bryan Mesich <bryan.mesich@xxxxxxxx> wrote:
> > The most popular mismatch_cnt values are 128 and 256.  The worst I
> > found was 21504 and 7168.  I find it interesting that all are
> > divisible by 128.
> 
> They should not appear on RAID5.

I would agree.  The only reason I mentioned RAID5 was to remove the
possibility that the HD were spontaneously flipping bits.  Our SAN
environment looks like the following:

|     Initiaror      |              FC Target                |
--------------------------------------------------------------
                        Block Dev -> LVM -> RAID5 -> Block Dev
ext3 -> LVM -> RAID1 {
                        Block Dev -> LVM -> RAID5 -> Block Dev

Since the RAID1 block devices on the initiator are sitting on a 
underlying RAID5 array (on the target), we should not notice 
random bit flips (or otherwise corruption) that a single drive 
might.

This is only one example of many that I have.  I've found mismatches
on other RAID1 arrays that are comprised of SATA, SAS, SCSI and/or SAN 
volumes (as shown in the above example).

> What kinds of filesystems reside on the RAID1s? If it's ext[23] (well,
> most likely on SANs it's not :)), then these mismatches are very likely
> located in inode blocks.

Yes, we are running ext3 :).

There is only 1 array that is running something else (ext4).

I'm not sure if I follow you on the inode problem.  If there was a
problem, shouldn't the problem be replicated to both block devices?

> I've noticed this quite often on RAID1s up to 2.6.26, especially on
> filesystems with heavy inode fluctuation (remove, create files).
> Starting with .26 (i guess, maybe later) I didn't see them anymore.
> Maybe yours on newer kernels are just brownfields? Correct them and see
> if they appear again?

The majority of our boxes are running the default RHEL 2.6.18 kernel.
I would be quicker to blame the RHEL kernel (missing
patches/back-ports), but I am seeing this on machines that are running
mainline kernels (as new as 2.6.29).

I'm not sure that that the problem can be attributed to heavy inode 
fluctuation since one of my worst offending arrays has only 119 files
on it (vmware guests w/pre-allocated disk) and is running ext4.  Even 
if there is FS problems, why doesn't the problem get replicated to both 
sides of the mirror?

I worry about running a repair for fear that I might tromp over
something.  I was hoping that future writes might fix some of the
problem?

> > So, is there anyway to get an output on which blocks do not match?  I'd
> > like to see how they are different, if at all.
> 
> just cmp -l the components, as you did before md brought up the
> sync_action check target :)

Thanks for the tip :)

Bryan
Attachment:
pgp8B7vaj0GNu.pgp

Description: PGP signature