Re: raid1 data corruption during resync

Eivind Sarto <eivindsarto@xxxxxxxxx> · Tue, 2 Sep 2014 09:43:31 -0700

On Sep 2, 2014, at 7:10 AM, Brassow Jonathan <jbrassow@xxxxxxxxxx> wrote:

> There is absolutely an issue with the mentioned commit.
> 
> We are seeing symptoms of a different sort.  Our testing group is doing device fault testing.  (They are using LVM to create the RAID devices, but as you know, that uses the same underlying kernel code.)  We encounter situations where the resync thread is stuck waiting for a barrier to come down, but it never does.  All I/O to the RAID1 device blocks.
> 
> There are a few different variations to how the problem manifests itself, but I have bisected the kernel to this commit (79ef3a8).
> 
> Unfortunately, it does take quite some time (hours) to reproduce the issue with our test scripts.  I will try testing this proposed patch while I try to figure out what the patch is doing and what might have gone wrong.
> 
> brassow
> 
> 
> On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote:
> 
>> I am seeing occasional data corruption during raid1 resync.
>> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug.
>> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero.  It no longer does this.
>> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it.
>> Once I validate that the fix indeed works, I will post a proper patch.
>> Do you have any feedback?
>> 
>> — drivers/md/raid1.c	2014-08-22 15:19:15.000000000 -0700
>> +++ /tmp/raid1.c	2014-08-29 12:07:51.000000000 -0700
>> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf 
>> 	 *    handling.
>> 	 */
>> 	wait_event_lock_irq(conf->wait_barrier,
>> -			    !conf->array_frozen &&
>> +			    !conf->array_frozen && !conf->nr_pending &&
>> 			    conf->barrier < RESYNC_DEPTH &&
>> 			    (conf->start_next_window >=
>> 			     conf->next_resync + RESYNC_SECTORS),
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

The suggested patch is not the right thing to do.  I got confused about the big comment  about /* Barriers…. */  The comment suggests that conf->nr_pending is used for the implementation of barriers.  This is not the case.  The new barrier code no longer uses nr_pending.
nr_pending is now only used when freezing the array.
Sorry for the confusion.
I back-ported the new barrier code to an older version of linux.  I could have introduced a bug during the back-porting.  I will try and upgrade my system to a newer kernel and see if I can reproduce the corruption.

-eivind--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html