Re: High mismatch count on root device - how to best handle?

Mark Knecht <markknecht@xxxxxxxxx> · Tue, 26 Apr 2011 10:22:56 -0700

On Mon, Apr 25, 2011 at 6:30 PM, Mark Knecht <markknecht@xxxxxxxxx> wrote:
> On Mon, Apr 25, 2011 at 3:32 PM, Mark Knecht <markknecht@xxxxxxxxx> wrote:
>> I did a drive check today, first time in months, and found I have a
>> high mismatch count on my RAID1 root device. What's the best way to
>> handle getting this cleaned up?
>>
>> 1) I'm running some smartctl tests as I write this.
>>
>> 2) Do I just do an
>>
>> echo repair
>>
>> to md126 or do I have to boot a rescue CD before I do that?
>>
>> If you need more info please let me know.
>>
>> Thanks,
>> Mark
>>
>> c2stable ~ # cat /sys/block/md3/md/mismatch_cnt
>> 0
>> c2stable ~ # cat /sys/block/md6/md/mismatch_cnt
>> 0
>> c2stable ~ # cat /sys/block/md7/md/mismatch_cnt
>> 0
>> c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
>> 222336
>> c2stable ~ # df
>> Filesystem Â Â Â Â Â 1K-blocks Â Â ÂUsed Available Use% Mounted on
>> /dev/md126 Â Â Â Â Â Â51612920 Â26159408 Â22831712 Â54% /
>> udev Â Â Â Â Â Â Â Â Â Â 10240 Â Â Â 432 Â Â Â9808 Â 5% /dev
>> /dev/md7 Â Â Â Â Â Â 389183252 144979184 224434676 Â40% /VirtualMachines
>> shm Â Â Â Â Â Â Â Â Â Â6151452 Â Â Â Â 0 Â 6151452 Â 0% /dev/shm
>> c2stable ~ # cat /proc/mdstat
>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
>> md6 : active raid1 sdc6[2] sda6[0] sdb6[1]
>> Â Â Â247416933 blocks super 1.1 [3/3] [UUU]
>>
>> md7 : active raid6 sdc7[2] sda7[0] sdb7[1] sdd2[3] sde2[4]
>> Â Â Â395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]
>>
>> md3 : active raid6 sdc3[2] sda3[0] sdb3[1] sdd3[3] sde3[4]
>> Â Â Â157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]
>>
>> md126 : active raid1 sdc5[2] sda5[0] sdb5[1]
>> Â Â Â52436032 blocks [3/3] [UUU]
>>
>> unused devices: <none>
>> c2stable ~ #
>>
>
> The smartctl tests that I ran (long) completed without error on all 5
> drives in the system:
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num ÂTest_Description Â ÂStatus Â Â Â Â Â Â Â Â ÂRemaining
> LifeTime(hours) ÂLBA_of_first_error
> # 1 ÂExtended offline Â ÂCompleted without error Â Â Â 00% Â Â Â2887 Â Â Â Â -
> # 2 ÂExtended offline Â ÂCompleted without error Â Â Â 00% Â Â Â2046 Â Â Â Â -
>
>
> So, if I understand correctly the next step I'd do would be something like
>
> echo repair >/sys/block/md126/md/sync_action
>
> but I'm unclear about the need to do this when mdadm seems to think
> the RAID is clean:
>
> c2stable ~ # mdadm -D /dev/md126
> /dev/md126:
> Â Â Â ÂVersion : 0.90
> ÂCreation Time : Tue Apr 13 09:02:34 2010
> Â Â Raid Level : raid1
> Â Â Array Size : 52436032 (50.01 GiB 53.69 GB)
> ÂUsed Dev Size : 52436032 (50.01 GiB 53.69 GB)
> Â Raid Devices : 3
> ÂTotal Devices : 3
> Preferred Minor : 126
> Â ÂPersistence : Superblock is persistent
>
> Â ÂUpdate Time : Mon Apr 25 18:29:39 2011
> Â Â Â Â ÂState : clean
> ÂActive Devices : 3
> Working Devices : 3
> ÂFailed Devices : 0
> ÂSpare Devices : 0
>
> Â Â Â Â Â UUID : edb0ed65:6e87b20e:dc0d88ba:780ef6a3
> Â Â Â Â Events : 0.248880
>
> Â ÂNumber Â Major Â Minor Â RaidDevice State
> Â Â Â 0 Â Â Â 8 Â Â Â Â5 Â Â Â Â0 Â Â Âactive sync Â /dev/sda5
> Â Â Â 1 Â Â Â 8 Â Â Â 21 Â Â Â Â1 Â Â Âactive sync Â /dev/sdb5
> Â Â Â 2 Â Â Â 8 Â Â Â 37 Â Â Â Â2 Â Â Âactive sync Â /dev/sdc5
> c2stable ~ #
>
> Thanks in advance.
>
> Cheers,
> Mark
>

OK, I don't know exactly what I'm looking for a problem here. I ran
the repair, then rebooted. Mismatch count was zero. It seemed the
repair had worked.

I then used the system for about 4 hours. After 4 hours I did another
check and found the mismatch count had increased.

What I need to get a handle on is:

1) Is this serious? (I assume yes)

2) How do I figure out which drive(s) of the 3 is having trouble?

3) If there is a specific drive, what is the process to swap it out?

Thanks,
Mark

c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
0
c2stable ~ # echo check >/sys/block/md126/md/sync_action
c2stable ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md6 : active raid1 sdb6[1] sdc6[2] sda6[0]
      247416933 blocks super 1.1 [3/3] [UUU]

md7 : active raid6 sdb7[1] sdc7[2] sde2[4] sda7[0] sdd2[3]
      395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

md3 : active raid6 sdb3[1] sdc3[2] sda3[0] sdd3[3] sde3[4]
      157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

md126 : active raid1 sdc5[2] sdb5[1] sda5[0]
      52436032 blocks [3/3] [UUU]
      [>....................]  check =  1.1% (626560/52436032)
finish=11.0min speed=78320K/sec

unused devices: <none>
c2stable ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md6 : active raid1 sdb6[1] sdc6[2] sda6[0]
      247416933 blocks super 1.1 [3/3] [UUU]

md7 : active raid6 sdb7[1] sdc7[2] sde2[4] sda7[0] sdd2[3]
      395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

md3 : active raid6 sdb3[1] sdc3[2] sda3[0] sdd3[3] sde3[4]
      157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

md126 : active raid1 sdc5[2] sdb5[1] sda5[0]
      52436032 blocks [3/3] [UUU]
      [===========>.........]  check = 59.6% (31291776/52436032)
finish=5.5min speed=63887K/sec

unused devices: <none>
c2stable ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md6 : active raid1 sdb6[1] sdc6[2] sda6[0]
      247416933 blocks super 1.1 [3/3] [UUU]

md7 : active raid6 sdb7[1] sdc7[2] sde2[4] sda7[0] sdd2[3]
      395387904 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5]
[UUUUU]

md3 : active raid6 sdb3[1] sdc3[2] sda3[0] sdd3[3] sde3[4]
      157305168 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5]
[UUUUU]

md126 : active raid1 sdc5[2] sdb5[1] sda5[0]
      52436032 blocks [3/3] [UUU]

unused devices: <none>
c2stable ~ # cat /sys/block/md126/md/mismatch_cnt
7424
c2stable ~ #
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html