Re: Reliability of RAID 5 repair function (mismatch_cnt 9560824)

Wols Lists <antlists@xxxxxxxxxxxxxxx> · Thu, 22 Feb 2018 11:13:53 +0000

On 21/02/18 21:09, Michael Metze wrote:
> Hello there,
> 
> I am running a RAID 5 consisting of 4 Seagate 4TB NAS Drives ST4000VN000
> for 4 years now. The raid device is "scrubbed" every month using the
> "check" function. There was never a problem. The filesystem is a
> journaled ext4.
> 
> Last week I added another external backup drive, and after a reboot, I
> was missing disk 4 (sdd) of the RAID. It was physically turned on, no
> error in the logs, but md0 was degraded. SMART data are fine. I added it
> back manually, and since I use a bitmap, it was accepted immediately. I
> run a "check" or scrub afterwards which went fine.
> 
This backup is nothing to do with the raid, I presume? Is it on USB
because that causes problems for raid? Whatever, if it's not part of the
raid then copying TO it should not cause any problems.

> Anyway, after some heavy copy actions on the raid, I moved about 1/3 of
> the data to the new backup drive, since I do not need it on the RAID.
> After another reboot, the mount process failed, reported the fsck was
> not clean. I started a fsck, but this one was reporting massive inode
> errors ... so I stopped it, to run another "check" on the RAID, which
> gave me a mismatch_cnt 9560824, which seems to be quite high.

If you've never had any errors before, that really is a lot!
> 
> Right now I can mount the filesystem read-only, but two important
> directories, which I didn't touch for almost 2 years are gone. I can not
> explain what went wrong.
> 
> I read and understood
> https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives
> "With a raid-5 array the only thing that can be done when there is an
> error is to correct the parity. This is also the most likely error - the
> scenario where the data has been flushed and the parity not updated is
> the expected cause of problems like this."
> 
> Is there any way to detect which drive has a problem? Of course I
> suspect drive 4. How reliable is the repair function of mdadm? I want to
> make sure, the RAID integrity is OK before I try to recover data from
> the filesystem, which is probably quite a big next step. Otherwise I may
> consider to try a repair with one drive 1-3 assembled in the RAID.

Okay. Run a SMART test on all the drives, especially drive 4.

If you suspect a failed drive, then *DO* *NOT* run a repair, because
this is not the normal "corrupt parity" problem - parity is scattered
across all drives which means a lot of *data* is corrupted, which means
a repair will trash it forever.
> 
> Many many thanks for any hints in understanding the situation.
> Michael
> 
Okay, take drive 4 out, do a force-assemble of the other three, and try
a check-only fsck. If that says everything is okay, then you know drive
4 is dud.

I'll leave you with that for the moment - come back with the results of
the SMART and the three-drive fsck.

In the meantime, think seriously about going raid-6. You've backed up
1/3 of your 12GB - does that mean you could resize your array as a 8GB
raid-6? Or could you add a fifth drive for a 12GB raid-6?

Cheers,
Wol
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html