Re: raid5 (re)-add recovery data corruption

NeilBrown <neilb@xxxxxxx> · Mon, 30 Jun 2014 13:23:35 +1000

On Sat, 28 Jun 2014 18:43:00 -0500 Bill <billstuff2001@xxxxxxxxxxxxx> wrote:

> On 06/22/2014 08:36 PM, NeilBrown wrote:
> > On Sat, 21 Jun 2014 00:31:39 -0500 Bill<billstuff2001@xxxxxxxxxxxxx>  wrote:
> >
> >> Hi Neil,
> >>
> >> I'm running a test on 3.14.8 and seeing data corruption after a recovery.
> >> I have this array:
> >>
> >>       md5 : active raid5 sdc1[2] sdb1[1] sda1[0] sde1[4] sdd1[3]
> >>             16777216 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
> >>             bitmap: 0/1 pages [0KB], 2048KB chunk
> >>
> >> with an xfs filesystem on it:
> >>       /dev/md5 on /hdtv/data5 type xfs
> >> (rw,noatime,barrier,swalloc,allocsize=256m,logbsize=256k,largeio)
> >>
> >> and I do this in a loop:
> >>
> >> 1. start writing 1/4 GB files to the filesystem
> >> 2. fail a disk. wait a bit
> >> 3. remove it. wait a bit
> >> 4. add the disk back into the array
> >> 5. wait for the array to sync and the file writes to finish
> >> 6. checksum the files.
> >> 7. wait a bit and do it all again
> >>
> >> The checksum QC will eventually fail, usually after a few hours.
> >>
> >> My last test failed after 4 hours:
> >>
> >>       18:51:48 - mdadm /dev/md5 -f /dev/sdc1
> >>       18:51:58 - mdadm /dev/md5 -r /dev/sdc1
> >>       18:52:06 - start writing 3 files
> >>       18:52:08 - mdadm /dev/md5 -a /dev/sdc1
> >>       18:52:18 - array recovery done
> >>       18:52:23 - writes finished. QC failed for one of three files.
> >>
> >> dmesg shows no errors and the disks are operating normally.
> >>
> >> If I "check" /dev/md5 it shows mismatch_cnt = 896
> >> If I dump the raw data on sd[abcde]1 underneath the bad file, it shows
> >> sd[abde]1 are correct, and sdc1 has some chunks of old data from a
> >> previous file.
> >>
> >> If I fail sdc1, --zero-superblock it, and add it, it then syncs and the
> >> QC is correct.
> >>
> >> So somehow is seems like md is loosing track of some changes which need
> >> to be
> >> written to sdc1 in the recovery. But rarely - in this case it failed
> >> after 175 cycles.
> >>
> >> Do you have any idea what could be happening here?
> > No.  As you say, it looks like md is not setting a bit in the bitmap
> > correctly, or ignoring one that is set, or maybe clearing one that shouldn't
> > be cleared.
> > The last is most likely I would guess.
> 
> Neil,
> 
> I'm still digging through this but I found something that might help 
> narrow it
> down - the bitmap stays dirty after the re-add and recovery is complete:
> 
>          Filename : /dev/sde1
>             Magic : 6d746962
>           Version : 4
>              UUID : 609846f8:ad08275f:824b3cb4:2e180e57
>            Events : 5259
>    Events Cleared : 5259
>             State : OK
>         Chunksize : 2 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 4194304 (4.00 GiB 4.29 GB)
>            Bitmap : 2048 bits (chunks), 2 dirty (0.1%)
>                                         ^^^^^^^^^^^^^^
> 
> This is after 1/2 hour idle. sde1 was the one removed / re-added, but
> all five disks show the same bitmap info, and the event count matches 
> that of
> the array (5259). At this point the QC check fails.
> 
> Then I manually failed, removed and re-added /dev/sde1, and shortly the 
> array
> synced the dirty chunks:
> 
>          Filename : /dev/sde1
>             Magic : 6d746962
>           Version : 4
>              UUID : 609846f8:ad08275f:824b3cb4:2e180e57
>            Events : 5275
>    Events Cleared : 5259
>             State : OK
>         Chunksize : 2 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 4194304 (4.00 GiB 4.29 GB)
>            Bitmap : 2048 bits (chunks), 0 dirty (0.0%)
>                                         ^^^^^^^^^^^^^^
> 
> Now the QC check succeeds and an array "check" shows no mismatches.
> 
> So it seems like md is ignoring a set bit in the bitmap, which then gets 
> noticed
> with the fail / remove / re-add sequence.

Thanks, that helps a lot ... maybe.

I have a theory.  This patch explains it and should fix it.
I'm not sure this is the patch I will go with if it works, but it will help
confirm my theory.
Can you test it?

thanks,
NeilBrown

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 34846856dbc6..27387a3740c8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7906,6 +7906,15 @@ void md_check_recovery(struct mddev *mddev)
 			clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
 			clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
 			set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
+			/* If there is a bitmap, we need to make sure
+			 * all writes that started before we added a spare
+			 * complete before we start doing a recovery.
+			 * Otherwise the write might complete and set
+			 * a bit in the bitmap after the recovery has
+			 * checked that bit and skipped that region.
+			 */
+			mddev->pers->quiesce(mddev, 1);
+			mddev->pers->quiesce(mddev, 0);
 		} else if (mddev->recovery_cp < MaxSector) {
 			set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);

Attachment:
signature.asc

Description: PGP signature