RE: raid6 - data integrity issue - data mis-compare on rebuilding RAID 6 - with 100 Mb resync speed.

"Manibalan P" <pmanibalan@xxxxxxxxxxxxxx> · Thu, 22 May 2014 17:17:02 +0530

>Can you share your exact test scripts?  I'm having a hard time reproducing this with something like:

>echo 100000 > /proc/sys/dev/raid/speed_limit_min
>mdadm --add /dev/md0 /dev/sd[bc]; dd if=urandom.dump of=/dev/md0 bs=1024M oflag=sync

I have attached the script which will simulate 2 drive failure in RAID 6.  And for running IO, we used Dit32. But you  can use any
Data verification tool, which does write some known pattern and read back the data and verify.

>-----Original Message-----
>From: dan.j.williams@xxxxxxxxx [mailto:dan.j.williams@xxxxxxxxx] On Behalf Of Dan Williams
>Sent: Tuesday, May 20, 2014 5:52 AM
>To: NeilBrown
>Cc: Manibalan P; linux-raid
>Subject: Re: raid6 - data integrity issue - data mis-compare on rebuilding RAID 6 - with 100 Mb resync speed.

>On Fri, May 16, 2014 at 11:11 AM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> On Mon, May 5, 2014 at 12:21 AM, NeilBrown <neilb@xxxxxxx> wrote:
>> On Wed, 23 Apr 2014 10:02:00 -0700 Dan Williams 
>> <dan.j.williams@xxxxxxxxx>
>> wrote:
>>
>>> On Wed, Apr 23, 2014 at 12:07 AM, NeilBrown <neilb@xxxxxxx> wrote:
>>> > On Fri, 11 Apr 2014 17:41:12 +0530 "Manibalan P" 
>>> > <pmanibalan@xxxxxxxxxxxxxx>
>>> > wrote:
>>> >
>>> >> Hi Neil,
>>> >>
>>> >> Also, I found the data corruption issue on RHEL 6.5.
>>> >>
>>> >> For your kind attention, I up-ported the md code [raid5.c + 
>>> >> raid5.h] from FC11 kernel to CentOS 6.4, and there is no 
>>> >> mis-compare with the up-ported code.
>>> >
>>> > This narrows it down to between 2.6.29 and 2.6.32 - is that correct?
>>> >
>>> > So it is probably the change to RAID6 to support async parity calculations.
>>> >
>>> > Looking at the code always makes my head spin.
>>> >
>>> > Dan : have you any ideas?
>>> >
>>> > It seems that writing to a double-degraded RAID6 while it is 
>>> > recovering to a space can trigger data corruption.
>>> >
>>> > 2.6.29 works
>>> > 2.6.32 doesn't
>>> > 3.8.0 still doesn't.
>>> >
>>> > I suspect async parity calculations.
>>>
>>> I'll take a look.  I've had cleanups of that code on my backlog for 
>>> "a while now (TM)".
>>
>>
>> Hi Dan,
>>  did you have a chance to have a look?
>>
>> I've been consistently failing to find anything.
>>
>> I have a question though.
>> If we set up a chain of async dma handling via:
>>    ops_run_compute6_2 then ops_bio_drain then ops_run_reconstruct
>>
>> is it possible for the ops_complete_compute callback set up by
>> ops_run_compute6_2 to be called before ops_run_reconstruct has been 
>> scheduled or run?
>
> In the absence of a dma engine we never run asynchronously, so we will
> *always* call ops_complete_compute() before ops_run_reconstruct() in 
> the synchronous acse.  This looks confused.  We're certainly leaking 
> an uptodate state prior to the completion of the write.
>
>> If so, there seems to be some room for confusion over the setting for 
>> R5_UPTODATE on blocks that are being computed and then drained to.  
>> Both will try to set the flag, so it could get set before reconstruction has run.
>>
>> I can't see that this would cause a problem, but then I'm not 
>> entirely sure why we clear R5_UPTODATE when we set R5_Wantdrain.
>
> Let me see what problems this could be causing.  I'm thinking we 
> should be protected by the global ->reconstruct_state, but something 
> is telling me we do depend on R5_UPTODATE being consistent with the 
> ongoing stripe operation.
>

Can you share your exact test scripts?  I'm having a hard time reproducing this with something like:

echo 100000 > /proc/sys/dev/raid/speed_limit_min
mdadm --add /dev/md0 /dev/sd[bc]; dd if=urandom.dump of=/dev/md0 bs=1024M oflag=sync

This is a 7-drive raid6 array.
Attachment:
RollingHotSpareTwoDriveFailure.sh

Description: RollingHotSpareTwoDriveFailure.sh