On Sat, 28 Jun 2014 18:43:00 -0500 Bill <billstuff2001@xxxxxxxxxxxxx> wrote: > On 06/22/2014 08:36 PM, NeilBrown wrote: > > On Sat, 21 Jun 2014 00:31:39 -0500 Bill<billstuff2001@xxxxxxxxxxxxx> wrote: > > > >> Hi Neil, > >> > >> I'm running a test on 3.14.8 and seeing data corruption after a recovery. > >> I have this array: > >> > >> md5 : active raid5 sdc1[2] sdb1[1] sda1[0] sde1[4] sdd1[3] > >> 16777216 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU] > >> bitmap: 0/1 pages [0KB], 2048KB chunk > >> > >> with an xfs filesystem on it: > >> /dev/md5 on /hdtv/data5 type xfs > >> (rw,noatime,barrier,swalloc,allocsize=256m,logbsize=256k,largeio) > >> > >> and I do this in a loop: > >> > >> 1. start writing 1/4 GB files to the filesystem > >> 2. fail a disk. wait a bit > >> 3. remove it. wait a bit > >> 4. add the disk back into the array > >> 5. wait for the array to sync and the file writes to finish > >> 6. checksum the files. > >> 7. wait a bit and do it all again > >> > >> The checksum QC will eventually fail, usually after a few hours. > >> > >> My last test failed after 4 hours: > >> > >> 18:51:48 - mdadm /dev/md5 -f /dev/sdc1 > >> 18:51:58 - mdadm /dev/md5 -r /dev/sdc1 > >> 18:52:06 - start writing 3 files > >> 18:52:08 - mdadm /dev/md5 -a /dev/sdc1 > >> 18:52:18 - array recovery done > >> 18:52:23 - writes finished. QC failed for one of three files. > >> > >> dmesg shows no errors and the disks are operating normally. > >> > >> If I "check" /dev/md5 it shows mismatch_cnt = 896 > >> If I dump the raw data on sd[abcde]1 underneath the bad file, it shows > >> sd[abde]1 are correct, and sdc1 has some chunks of old data from a > >> previous file. > >> > >> If I fail sdc1, --zero-superblock it, and add it, it then syncs and the > >> QC is correct. > >> > >> So somehow is seems like md is loosing track of some changes which need > >> to be > >> written to sdc1 in the recovery. But rarely - in this case it failed > >> after 175 cycles. > >> > >> Do you have any idea what could be happening here? > > No. As you say, it looks like md is not setting a bit in the bitmap > > correctly, or ignoring one that is set, or maybe clearing one that shouldn't > > be cleared. > > The last is most likely I would guess. > > Neil, > > I'm still digging through this but I found something that might help > narrow it > down - the bitmap stays dirty after the re-add and recovery is complete: > > Filename : /dev/sde1 > Magic : 6d746962 > Version : 4 > UUID : 609846f8:ad08275f:824b3cb4:2e180e57 > Events : 5259 > Events Cleared : 5259 > State : OK > Chunksize : 2 MB > Daemon : 5s flush period > Write Mode : Normal > Sync Size : 4194304 (4.00 GiB 4.29 GB) > Bitmap : 2048 bits (chunks), 2 dirty (0.1%) > ^^^^^^^^^^^^^^ > > This is after 1/2 hour idle. sde1 was the one removed / re-added, but > all five disks show the same bitmap info, and the event count matches > that of > the array (5259). At this point the QC check fails. > > Then I manually failed, removed and re-added /dev/sde1, and shortly the > array > synced the dirty chunks: > > Filename : /dev/sde1 > Magic : 6d746962 > Version : 4 > UUID : 609846f8:ad08275f:824b3cb4:2e180e57 > Events : 5275 > Events Cleared : 5259 > State : OK > Chunksize : 2 MB > Daemon : 5s flush period > Write Mode : Normal > Sync Size : 4194304 (4.00 GiB 4.29 GB) > Bitmap : 2048 bits (chunks), 0 dirty (0.0%) > ^^^^^^^^^^^^^^ > > Now the QC check succeeds and an array "check" shows no mismatches. > > So it seems like md is ignoring a set bit in the bitmap, which then gets > noticed > with the fail / remove / re-add sequence. Thanks, that helps a lot ... maybe. I have a theory. This patch explains it and should fix it. I'm not sure this is the patch I will go with if it works, but it will help confirm my theory. Can you test it? thanks, NeilBrown diff --git a/drivers/md/md.c b/drivers/md/md.c index 34846856dbc6..27387a3740c8 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7906,6 +7906,15 @@ void md_check_recovery(struct mddev *mddev) clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); + /* If there is a bitmap, we need to make sure + * all writes that started before we added a spare + * complete before we start doing a recovery. + * Otherwise the write might complete and set + * a bit in the bitmap after the recovery has + * checked that bit and skipped that region. + */ + mddev->pers->quiesce(mddev, 1); + mddev->pers->quiesce(mddev, 0); } else if (mddev->recovery_cp < MaxSector) { set_bit(MD_RECOVERY_SYNC, &mddev->recovery); clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
Attachment:
signature.asc
Description: PGP signature