Re: High mismatch count on root device - how to best handle?

Mark Knecht <markknecht@xxxxxxxxx> · Wed, 27 Apr 2011 17:38:29 -0700

On Tue, Apr 26, 2011 at 12:38 PM, Phil Turmel <philip@xxxxxxxxxx> wrote:
> Hi Mark,
>
> On 04/26/2011 01:22 PM, Mark Knecht wrote:
>> On Mon, Apr 25, 2011 at 6:30 PM, Mark Knecht <markknecht@xxxxxxxxx> wrote:
> [trim /]
>
>> OK, I don't know exactly what I'm looking for a problem here. I ran
>> the repair, then rebooted. Mismatch count was zero. It seemed the
>> repair had worked.
>>
>> I then used the system for about 4 hours. After 4 hours I did another
>> check and found the mismatch count had increased.
>>
>> What I need to get a handle on is:
>>
>> 1) Is this serious? (I assume yes)
>
> Maybe. ÂAre you using a file in this filesystem as swap in lieu of a dedicated swap partition?
>

No, swap is on 3 drives as 3 partitions. The kernel runs swap and it
has nothing to do with RAID other than it shares a portion of the
drives.

> I vaguely recall reading that certain code paths in the swap logic can abandon queued writes (due to the data no longer being needed by the VM), such that one or more raid members are left inconsistent. ÂSupposedly only affecting mirrored raid, and only for swap files/partitions.
>
> I don't know if this was ever fixed. Âor even if anyone tried to fix it.
>

md126 is the main 3-drive RAID1 root partition of a Gentoo install.
Kernel is 2.6.38-gentoo-r1 and I'm using mdadm-3.1.4.

Nothing I do with echo repair seems to stick very well. For a few
moments mismatch_cnt will read 0, but as far as I can tell if I do
another echo check then I Get another high mismatch_cnt again.

Once thing I'm wondering about is whether repair even works on a
3-disk RAID1? I've seen threads out there that suggest it doesn't and
that possibly it's just bypassing the actual repair operation?

>> 2) How do I figure out which drive(s) of the 3 is having trouble?
>
> Don't know. ÂFailing drives usually give themselves away with warnings in dmesg, and/or ejection from the array. ÂThere's nothing in the kernel or mdadm that'll help here. ÂYou'd have to do three-way voting comparison of all blocks on the member partitions.
>
>> 3) If there is a specific drive, what is the process to swap it out?
>
> mdadm /dev/mdX --fail /dev/sdXY
> mdadm /dev/mdX --remove /dev/sdXY
>
> (swap drives)
>
> mdadm /dev/mdX --add /dev/sdZY
>

I will have some additional things to figure out. There are 5 drives
in this box with a mixture of 3-drive RAID1 & 5-drive RAID6 across
them. If I pull a drive then I need to ensure that all four RAIDs are
going to get rebuilt correctly. I suspect they will, but I'll want to
be careful.

Still, if I haven't a clue which drive is causing the mismatch then I
cannot know which one to pull..

Thanks for your inputs!

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html