Re: /etc/cron.weekly/99-raid-check

Sujit K M <sjt.kar@xxxxxxxxx> · Thu, 3 Dec 2009 16:19:59 +0530

Kindly Follow some mailing list critique. Donot jump up unrelated
threads without
knowing what is being discussed.

On Thu, Dec 3, 2009 at 2:41 PM, CoolCold <coolthecold@xxxxxxxxx> wrote:
> On Wed, Dec 2, 2009 at 6:38 PM,  <greg@xxxxxxxxxxxx> wrote:
>> On Nov 30,  2:08pm, Farkas Levente wrote:
>> } Subject: /etc/cron.weekly/99-raid-check
>>
>>> hi,
>>
>> Hi Farkas, hope your day is going well.  Just thought I would respond
>> for the edification of others who are troubled by this issue.
>>
>>> it's been a few weeks since rhel/centos 5.4 released and there were many
>>> discussion about this new "feature" the weekly raid partition check.
>>> we've got a lot's of server with raid1 system and i already try to
>>> configure them not to send these messages, but i'm not able ie. i
>>> already add to the SKIP_DEVS all of my swap partitions (since i read it
>>> on linux-kernel list that there can be mismatch_cnt even though i still
>>> not understand why?). but even the data partitions (ie. all of my
>>> servers all raid1 partitions) produce this error (ie. ther mismatch_cnt
>>> is never 0 at the weekend). and this cause all of my raid1 partitions
>>> are rebuild during the weekend. and i don't like it:-(
>>> so my questions:
>>> - is it a real bug in the raid1 system?
>>> - is it a real bug in my disk which runs raid (not really believe since
>>> it's dozens of servers)?
>>> - the /etc/cron.weekly/99-raid-check is wrong in rhel/centos-5.4?
>>> or what's the problem?
>>> can someone enlighten me?
>>
>> Its a combination of what I would consider a misfeature with what MAY
>> BE, and I stress MAY be a sentient bug someplace.
>>
>> The current RAID/IO stack does not 'pin' pages which are destined to
>> be written out to disk.  As a result the contents of the pages may
>> change as the request to do I/O against these pages transits the I/O
>> stack down to disk.
>
> Can you write a bit more about  "the pages may change"? 'Who' can
> change page contents ?
>
>
>> This results in a 'race' condition where one side of a RAID1 mirror
>> gets one version ofdata written to it while the other side of the
>> mirror gets a different piece of data written to it.  In the case of a
>> swap partition this appears to be harmless.  In the case of
>> filesystems there seems to be a general assurance that this occurs
>> only in uninhabited portions of the filesystem.
>>
>> The 'check' feature of the MD system which the 99-raid-check uses
>> reads the underlying physical devices of a composite RAID device.  The
>> mismatch_cnt is elevated if the contents of mirrored sectors are not
>> identical.
>>
>> The results of the intersection of all this are problematic now that
>> major distributions have included this raid-check feature.  There are
>> probably hundreds if not thousands of systems which are reporting what
>> may or may not be false positives with respect to data corruption.
>>
>> The current RAID stack has an option to 'repair' a RAID set which has
>> mismatches.  Unfortunately there is no intelligence in this facility
>> and it randomly picks one of the sectors as being 'good' and uses that
>> to replace the contents of the other sector.  I'm somewhat reticent to
>> recommend the use of this facility given the issues at hand.
>>
>> A complicating factor is that the kernel does not report the location
>> of where the mismatches occur.  There appears to be movement underway
>> to include support in the kernel for printing out the sector locations
>> of the mismatches.
>>
>> When that feature becomes available there will be a need to have some
>> type of tool, in the case of RAID1 devices backing filesystems, to
>> make an assessment of which version of the data is 'correct' so the
>> faulty version can be over-written with the correct version.
>>
>>        As an aside what is really needed is a tool which assesses
>>        whether or not the mismatched sectors are actually in an
>>        inhabited portion of the filesystem.  If not the 'repair'
>>        facility on RAID1 could be presumably run with no issues.
>>        Given the appropriate coherency/validation checks to make sure
>>        the sectors are still incoherent secondary to a race where the
>>        uninhabited portion chooses to become inhabited.
>>
>> We see the issue over a large range of production systems running
>> standard RHEL5 kernels all the way up to recent versions of Fedora.
>> Interestingly the mismatch counts are always an exact multiple of 128
>> on all the systems.
>>
>> We have also isolated the problem to be RAID1 and independent of the
>> backing store.  We run geographical mirrors where an initiator is fed
>> from two separate data-centers where each mirror half is based on a
>> RAID5 Linux target.  On RAID1 mirrors which are mismatched the two
>> separate RAID5 backing volumes both report completely consistent
>> volumes.
>>
>> So there is the situation as I believe it currently stands.
>>
>> The notion of running the 'check' sync_action is well founded.  The
>> issue of 'silent' data corruption is well understood and well founded.
>> The Linux RAID system as of a couple of years ago will re-write any
>> sectors which come up as unreadable during the check process.  Disk
>> drives will re-allocate a sector from their re-mapping pool
>> effectively replacing the bad sector.  This pays huge dividends with
>> respect to maintaining healthy RAID farms.
>>
>> Unfortunately the report of the mismatch_cnt's is problematic given
>> the above issues.  I think it is unfortunate the vendors opted to
>> release this checking/reporting while these issues are still unresolved.
>>
>>> thanks in advance.
>>> regards.
>>>
>>> --
>>>   Levente                               "Si vis pacem para bellum!"
>>
>> Hope the above information is helpful for everyone running into this
>> issue.
>>
>> Best wishes for a productive remainder of the week to everyone.
>>
>> Greg
>>
>> }-- End of excerpt from Farkas Levente
>>
>> As always,
>> Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
>> 4206 N. 19th Ave.           Specializing in information infra-structure
>> Fargo, ND  58102            development.
>> PH: 701-281-1686
>> FAX: 701-281-3949           EMAIL: greg@xxxxxxxxxxxx
>> ------------------------------------------------------------------------------
>> "Experience is something you don't get until just after you need it."
>>                                -- Olivier
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Best regards,
> [COOLCOLD-RIPN]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
-- Sujit K M

blog(http://kmsujit.blogspot.com/)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html