Re: raid1 data corruption during resync

Brassow Jonathan <jbrassow@xxxxxxxxxx> · Tue, 2 Sep 2014 14:24:08 -0500

On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote:

> I am seeing occasional data corruption during raid1 resync.
> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug.
> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero.  It no longer does this.
> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it.
> Once I validate that the fix indeed works, I will post a proper patch.
> Do you have any feedback?
> 
> — drivers/md/raid1.c	2014-08-22 15:19:15.000000000 -0700
> +++ /tmp/raid1.c	2014-08-29 12:07:51.000000000 -0700
> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf 
> 	 *    handling.
> 	 */
> 	wait_event_lock_irq(conf->wait_barrier,
> -			    !conf->array_frozen &&
> +			    !conf->array_frozen && !conf->nr_pending &&
> 			    conf->barrier < RESYNC_DEPTH &&
> 			    (conf->start_next_window >=
> 			     conf->next_resync + RESYNC_SECTORS),

This patch does not work - at least, it doesn't fix the issues I'm seeing.  My system hangs (in various places, like the resync thread) after commit 79ef3a8.  When testing this patch, I also added some code to dm-raid.c to allow me to print-out some of the variables when I encounter a problem.  After applying this patch and printing the variables, I see:
Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: start_next_window = 12288
Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: current_window_requests = -46
5257
Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: next_window_requests = -11562
Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: nr_pending = 0
Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: nr_waiting = 0
Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: nr_queued  = 0
Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: barrier = 1
Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: array_frozen = 0

Some of those values look pretty bizarre to me and suggest the accounting is pretty messed up.

 brassow

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html