Re: raid1 data corruption during resync

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sep 2, 2014, at 7:10 AM, Brassow Jonathan <jbrassow@xxxxxxxxxx> wrote:

> There is absolutely an issue with the mentioned commit.
> 
> We are seeing symptoms of a different sort.  Our testing group is doing device fault testing.  (They are using LVM to create the RAID devices, but as you know, that uses the same underlying kernel code.)  We encounter situations where the resync thread is stuck waiting for a barrier to come down, but it never does.  All I/O to the RAID1 device blocks.
> 
> There are a few different variations to how the problem manifests itself, but I have bisected the kernel to this commit (79ef3a8).
> 
> Unfortunately, it does take quite some time (hours) to reproduce the issue with our test scripts.  I will try testing this proposed patch while I try to figure out what the patch is doing and what might have gone wrong.
> 
> brassow
> 
> 
> On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote:
> 
>> I am seeing occasional data corruption during raid1 resync.
>> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug.
>> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero.  It no longer does this.
>> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it.
>> Once I validate that the fix indeed works, I will post a proper patch.
>> Do you have any feedback?
>> 
>> — drivers/md/raid1.c	2014-08-22 15:19:15.000000000 -0700
>> +++ /tmp/raid1.c	2014-08-29 12:07:51.000000000 -0700
>> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf 
>> 	 *    handling.
>> 	 */
>> 	wait_event_lock_irq(conf->wait_barrier,
>> -			    !conf->array_frozen &&
>> +			    !conf->array_frozen && !conf->nr_pending &&
>> 			    conf->barrier < RESYNC_DEPTH &&
>> 			    (conf->start_next_window >=
>> 			     conf->next_resync + RESYNC_SECTORS),
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Yes, you are correct.  I think above patch would just return the resync to the old implementation.
No resync-window.  Just either one of resync-IO or user-IO at any given time for the entire array.
As I said, the comment in the resync code no longer matches the actual implementation.

-eivind--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux