Re: raid1 data corruption during resync

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sep 2, 2014, at 5:07 PM, Eivind Sarto wrote:

> 
> On Sep 2, 2014, at 12:24 PM, Brassow Jonathan <jbrassow@xxxxxxxxxx> wrote:
> 
>> 
>> On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote:
>> 
>>> I am seeing occasional data corruption during raid1 resync.
>>> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug.
>>> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero.  It no longer does this.
>>> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it.
>>> Once I validate that the fix indeed works, I will post a proper patch.
>>> Do you have any feedback?
>>> 
>>> — drivers/md/raid1.c	2014-08-22 15:19:15.000000000 -0700
>>> +++ /tmp/raid1.c	2014-08-29 12:07:51.000000000 -0700
>>> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf 
>>> 	 *    handling.
>>> 	 */
>>> 	wait_event_lock_irq(conf->wait_barrier,
>>> -			    !conf->array_frozen &&
>>> +			    !conf->array_frozen && !conf->nr_pending &&
>>> 			    conf->barrier < RESYNC_DEPTH &&
>>> 			    (conf->start_next_window >=
>>> 			     conf->next_resync + RESYNC_SECTORS),
>> 
>> This patch does not work - at least, it doesn't fix the issues I'm seeing.  My system hangs (in various places, like the resync thread) after commit 79ef3a8.  When testing this patch, I also added some code to dm-raid.c to allow me to print-out some of the variables when I encounter a problem.  After applying this patch and printing the variables, I see:
>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: start_next_window = 12288
>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: current_window_requests = -46
>> 5257
>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: next_window_requests = -11562
>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: nr_pending = 0
>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: nr_waiting = 0
>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: nr_queued  = 0
>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: barrier = 1
>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: array_frozen = 0
>> 
>> Some of those values look pretty bizarre to me and suggest the accounting is pretty messed up.
>> 
>> brassow
>> 
> 
> After reviewing commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 I notice that wait_barrier() will now only exclude writes.  User reads are not excluded even if the fall within the resync window.
> The old implementation used to exclude both reads and writes while resync-IO is active.
> Could this be a cause of data corruption?

Not sure yet.  It is also worth mentioning that a follow-on commit (8b592b6) was meant to fix problems with request counting.  I did not have that patch applied when I gathered the above info.

 brassow

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux