Re: raid1 out of sync, but which files are affected?

Eyal Lebedinsky <eyal@xxxxxxxxxxxxxx> · Sun, 27 Jan 2019 10:44:20 +1100

On 27/1/19 10:21 am, Nik.Brt. wrote:
On 26/01/2019 11:49, Harald Dunkel wrote:
I initiated a check of my RAID1 (2 disks) this morning. mismatch_cnt
is at 128 by now.

AFAIR 128 is a rounded number, and you are not going to get it more precise than this.
It depends on the granularity of the check, which depends on the raid1 code.

There is no official way to do what you want.

Well, not exactly. What I did before the log messages were introduced was to run 'check'
operations in small sections. I had a script that read the whole array in 32 sections
by setting 'sync_min' and 'sync_max', then deal with the faulty section by dividing it
again into 32 sections etc.

Trying to do a 'check' of the whole array in very small sections takes a very long time,
hence the divide-and-conquer approach.

Another way is to log /proc/mdstat for the position and 'mismatch_cnt' for the status,
at short intervals (I used 10 seconds) until all the mismatches are found. One can then
stop the check by writing 'idle' to 'sync_action'.

This gives a good idea of the location of the bad stripe(s). One can then do a fine
'check' around the location(s) of the mismatched to identify the exact location(s).

HTH

You have to write a software in a generic programming language, to compare sector-per-sector (actually 4k granularity is the smallest you can go) the two underlying MD devices.
Since the array modifies while you check it, perform the check twice and "AND" the two results together, so to get the mismatches that happened on both checks.

You have to skip the MD header at the beginning and/or at the end, because it is going to be mismatched, and also in order to compute the offset precisely.

Then, if you want to publish your software... :-)

These mismatches happen, in raid1, but why they happen is not precisely known. There are a few ideas... and it is said that they are harmless in most cases (=outside of files).
The phenomenon happens a lot less if you have LVM over the raid1, and also this is not exactly known why.
It doesn't seem to happen in raid5/6.
Can't remember about raid10.

After you have the offset, it is like you say: use debugfs.

If you do the check, let us know ...
good luck

--
Eyal at Home (eyal@xxxxxxxxxxxxxx)