Re: mismatch_cnt constantly goes up on ssd+hdd raid1

NeilBrown <neilb@xxxxxxxx> · Thu, 25 Jun 2015 11:33:35 +1000

On Sun, 14 Jun 2015 20:13:16 +0300 tlknv <tlknv@xxxxxxxxx> wrote:

> Hello,

> I have raid 1 which mirrors a root/boot partition on 1SSD and 2HDD
> (write-mostly). mismatch_cnt goes up even when there are very few
> writes to the partition as /var is mounted separatly. After I update
> several packages I typically see mismatch_cnt somewhere between
> 500,000 and 2,000,000. I have read a number of threads in this DL
> but could not find an explanation of what could cause mismatch_cnt
> to grow that much. I checked md5 sums using
> /var/lib/dpkg/info/*.md5sums, and didn't see many errors, even
> though there are few, mostly in text files which look ok to me. I
> guess when I check, all reads go to SSD (as both HDDs in this raid
> are write-mostly), and thus md5sum only shows no problem on
> SSD. Note, this partition is used as both boot and root and just in
> case here is some more info about my system:

This does surprise me.

I had another look at the code and there could be a bug that would let
'check' see the difference between when the first write completes and
when the write-behind writes complete, but you would need to run the
check while the install was happening for that to be noticed, and even
then you would need to be unlucky.

What you could try is:
 - add a bitmap (mdadm --grow /dev/md0 --bitmap=internal) so that
   recovery will be fast if you remove then re-add a device.
 - fail and remove one of the HDDs
     mdadm /dev/md0 --fail /dev/sda2
     mdadm /dev/md0 --remove /dev/sda2
 - Find the data offset and use losetup to access the data directly.
    mdadm --examine /dev/sda2 | grep 'Data Offset'
        Data Offset : 160 sectors.
   convert that to 'K' and
    losetup --read-only --offset=80K /dev/loop0 /dev/sda2
 - perform some *read-only* examintion of loop0.
    fsck -n /dev/loop0
    mount -o ro /dev/loop0 /mnt

   and see if there are any differences in files that have changed
   recently.

 - when finished, "umount /mnt", "losetup -d /dev/loop0" and
     mdadm /dev/md0 --re-add /dev/sda2

> root@tbeh:~# sync; cmp -l /dev/sdc2 /dev/sda2|wc -l
> cmp: EOF on /dev/sdc2
> 1903215
> 
> BTW, only first few hundren bytes (at most) have non-zero value on SSD, the rest of differences has 0 bytes on SSD.
>                4233   0 347
>                4234  70  65
>                4235 232 241
>                4257   0   1

Any bytes before the "Data Offset" identified above could easily be
different, or after "Data Offset" + "Used Dev Size".
What bytes are different within that range/

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html