Re: raid1 data corruption during resync

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sep 2, 2014, at 12:24 PM, Brassow Jonathan <jbrassow@xxxxxxxxxx> wrote:

> 
> On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote:
> 
>> I am seeing occasional data corruption during raid1 resync.
>> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug.
>> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero.  It no longer does this.
>> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it.
>> Once I validate that the fix indeed works, I will post a proper patch.
>> Do you have any feedback?
>> 
>> — drivers/md/raid1.c	2014-08-22 15:19:15.000000000 -0700
>> +++ /tmp/raid1.c	2014-08-29 12:07:51.000000000 -0700
>> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf 
>> 	 *    handling.
>> 	 */
>> 	wait_event_lock_irq(conf->wait_barrier,
>> -			    !conf->array_frozen &&
>> +			    !conf->array_frozen && !conf->nr_pending &&
>> 			    conf->barrier < RESYNC_DEPTH &&
>> 			    (conf->start_next_window >=
>> 			     conf->next_resync + RESYNC_SECTORS),
> 
> This patch does not work - at least, it doesn't fix the issues I'm seeing.  My system hangs (in various places, like the resync thread) after commit 79ef3a8.  When testing this patch, I also added some code to dm-raid.c to allow me to print-out some of the variables when I encounter a problem.  After applying this patch and printing the variables, I see:
> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: start_next_window = 12288
> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: current_window_requests = -46
> 5257
> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: next_window_requests = -11562
> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: nr_pending = 0
> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: nr_waiting = 0
> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: nr_queued  = 0
> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: barrier = 1
> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: array_frozen = 0
> 
> Some of those values look pretty bizarre to me and suggest the accounting is pretty messed up.
> 
> brassow
> 

After reviewing commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 I notice that wait_barrier() will now only exclude writes.  User reads are not excluded even if the fall within the resync window.
The old implementation used to exclude both reads and writes while resync-IO is active.
Could this be a cause of data corruption?

-eivind--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux