Re: /etc/cron.weekly/99-raid-check

CoolCold <coolthecold@xxxxxxxxx> · Thu, 3 Dec 2009 12:11:29 +0300

On Wed, Dec 2, 2009 at 6:38 PM,  <greg@xxxxxxxxxxxx> wrote:
> On Nov 30,  2:08pm, Farkas Levente wrote:
> } Subject: /etc/cron.weekly/99-raid-check
>
>> hi,
>
> Hi Farkas, hope your day is going well.  Just thought I would respond
> for the edification of others who are troubled by this issue.
>
>> it's been a few weeks since rhel/centos 5.4 released and there were many
>> discussion about this new "feature" the weekly raid partition check.
>> we've got a lot's of server with raid1 system and i already try to
>> configure them not to send these messages, but i'm not able ie. i
>> already add to the SKIP_DEVS all of my swap partitions (since i read it
>> on linux-kernel list that there can be mismatch_cnt even though i still
>> not understand why?). but even the data partitions (ie. all of my
>> servers all raid1 partitions) produce this error (ie. ther mismatch_cnt
>> is never 0 at the weekend). and this cause all of my raid1 partitions
>> are rebuild during the weekend. and i don't like it:-(
>> so my questions:
>> - is it a real bug in the raid1 system?
>> - is it a real bug in my disk which runs raid (not really believe since
>> it's dozens of servers)?
>> - the /etc/cron.weekly/99-raid-check is wrong in rhel/centos-5.4?
>> or what's the problem?
>> can someone enlighten me?
>
> Its a combination of what I would consider a misfeature with what MAY
> BE, and I stress MAY be a sentient bug someplace.
>
> The current RAID/IO stack does not 'pin' pages which are destined to
> be written out to disk.  As a result the contents of the pages may
> change as the request to do I/O against these pages transits the I/O
> stack down to disk.

Can you write a bit more about  "the pages may change"? 'Who' can
change page contents ?

> This results in a 'race' condition where one side of a RAID1 mirror
> gets one version ofdata written to it while the other side of the
> mirror gets a different piece of data written to it.  In the case of a
> swap partition this appears to be harmless.  In the case of
> filesystems there seems to be a general assurance that this occurs
> only in uninhabited portions of the filesystem.
>
> The 'check' feature of the MD system which the 99-raid-check uses
> reads the underlying physical devices of a composite RAID device.  The
> mismatch_cnt is elevated if the contents of mirrored sectors are not
> identical.
>
> The results of the intersection of all this are problematic now that
> major distributions have included this raid-check feature.  There are
> probably hundreds if not thousands of systems which are reporting what
> may or may not be false positives with respect to data corruption.
>
> The current RAID stack has an option to 'repair' a RAID set which has
> mismatches.  Unfortunately there is no intelligence in this facility
> and it randomly picks one of the sectors as being 'good' and uses that
> to replace the contents of the other sector.  I'm somewhat reticent to
> recommend the use of this facility given the issues at hand.
>
> A complicating factor is that the kernel does not report the location
> of where the mismatches occur.  There appears to be movement underway
> to include support in the kernel for printing out the sector locations
> of the mismatches.
>
> When that feature becomes available there will be a need to have some
> type of tool, in the case of RAID1 devices backing filesystems, to
> make an assessment of which version of the data is 'correct' so the
> faulty version can be over-written with the correct version.
>
>        As an aside what is really needed is a tool which assesses
>        whether or not the mismatched sectors are actually in an
>        inhabited portion of the filesystem.  If not the 'repair'
>        facility on RAID1 could be presumably run with no issues.
>        Given the appropriate coherency/validation checks to make sure
>        the sectors are still incoherent secondary to a race where the
>        uninhabited portion chooses to become inhabited.
>
> We see the issue over a large range of production systems running
> standard RHEL5 kernels all the way up to recent versions of Fedora.
> Interestingly the mismatch counts are always an exact multiple of 128
> on all the systems.
>
> We have also isolated the problem to be RAID1 and independent of the
> backing store.  We run geographical mirrors where an initiator is fed
> from two separate data-centers where each mirror half is based on a
> RAID5 Linux target.  On RAID1 mirrors which are mismatched the two
> separate RAID5 backing volumes both report completely consistent
> volumes.
>
> So there is the situation as I believe it currently stands.
>
> The notion of running the 'check' sync_action is well founded.  The
> issue of 'silent' data corruption is well understood and well founded.
> The Linux RAID system as of a couple of years ago will re-write any
> sectors which come up as unreadable during the check process.  Disk
> drives will re-allocate a sector from their re-mapping pool
> effectively replacing the bad sector.  This pays huge dividends with
> respect to maintaining healthy RAID farms.
>
> Unfortunately the report of the mismatch_cnt's is problematic given
> the above issues.  I think it is unfortunate the vendors opted to
> release this checking/reporting while these issues are still unresolved.
>
>> thanks in advance.
>> regards.
>>
>> --
>>   Levente                               "Si vis pacem para bellum!"
>
> Hope the above information is helpful for everyone running into this
> issue.
>
> Best wishes for a productive remainder of the week to everyone.
>
> Greg
>
> }-- End of excerpt from Farkas Levente
>
> As always,
> Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
> 4206 N. 19th Ave.           Specializing in information infra-structure
> Fargo, ND  58102            development.
> PH: 701-281-1686
> FAX: 701-281-3949           EMAIL: greg@xxxxxxxxxxxx
> ------------------------------------------------------------------------------
> "Experience is something you don't get until just after you need it."
>                                -- Olivier
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html