Re: raid1 data corruption during resync

Brassow Jonathan <jbrassow@xxxxxxxxxx> · Tue, 2 Sep 2014 20:18:06 -0500

On Sep 2, 2014, at 6:55 PM, NeilBrown wrote:

> On Tue, 2 Sep 2014 15:07:26 -0700 Eivind Sarto <eivindsarto@xxxxxxxxx> wrote:
> 
>> 
>> On Sep 2, 2014, at 12:24 PM, Brassow Jonathan <jbrassow@xxxxxxxxxx> wrote:
>> 
>>> 
>>> On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote:
>>> 
>>>> I am seeing occasional data corruption during raid1 resync.
>>>> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug.
>>>> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero.  It no longer does this.
>>>> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it.
>>>> Once I validate that the fix indeed works, I will post a proper patch.
>>>> Do you have any feedback?
>>>> 
>>>> — drivers/md/raid1.c	2014-08-22 15:19:15.000000000 -0700
>>>> +++ /tmp/raid1.c	2014-08-29 12:07:51.000000000 -0700
>>>> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf 
>>>> 	 *    handling.
>>>> 	 */
>>>> 	wait_event_lock_irq(conf->wait_barrier,
>>>> -			    !conf->array_frozen &&
>>>> +			    !conf->array_frozen && !conf->nr_pending &&
>>>> 			    conf->barrier < RESYNC_DEPTH &&
>>>> 			    (conf->start_next_window >=
>>>> 			     conf->next_resync + RESYNC_SECTORS),
>>> 
>>> This patch does not work - at least, it doesn't fix the issues I'm seeing.  My system hangs (in various places, like the resync thread) after commit 79ef3a8.  When testing this patch, I also added some code to dm-raid.c to allow me to print-out some of the variables when I encounter a problem.  After applying this patch and printing the variables, I see:
>>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: start_next_window = 12288
>>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: current_window_requests = -46
>>> 5257
>>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: next_window_requests = -11562
>>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: nr_pending = 0
>>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: nr_waiting = 0
>>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: nr_queued  = 0
>>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: barrier = 1
>>> Sep  2 14:04:15 bp-01 kernel: device-mapper: raid: array_frozen = 0
>>> 
>>> Some of those values look pretty bizarre to me and suggest the accounting is pretty messed up.
>>> 
>>> brassow
>>> 
>> 
>> After reviewing commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 I notice that wait_barrier() will now only exclude writes.  User reads are not excluded even if the fall within the resync window.
>> The old implementation used to exclude both reads and writes while resync-IO is active.
>> Could this be a cause of data corruption?
>> 
> 
> Could be.
> From read_balance:
> 
> 	if (conf->mddev->recovery_cp < MaxSector &&
> 	    (this_sector + sectors >= conf->next_resync))
> 		choose_first = 1;
> 	else
> 		choose_first = 0;
> 
> This used to be safe because a read immediately before next_resync would wait
> until all resync requests completed.  But now that read requests don't block
> it isn't safe.
> Probably best to make this:
> 
>    choose_first = (conf->mddev->recovery_cp < this_sector + sectors);
> 
> Can you test that?

Unfortunately, I doubt very much that that will solve the issue that I am seeing.  (Although, I'd welcome this fix also.)

I have predominantly been testing RHEL kernels, but I've seen this with the latest upstream kernel as well.  When doing fault testing, like failing a drive while the resync is happening, the processes can become hung.  Here's one example, but they vary:

INFO: task mdX_resync:11334 blocked for more than 120 seconds.                  
      Not tainted 2.6.32-497.el6.x86_64 #1                                      
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.       
mdX_resync    D 0000000000000000     0 11334      2 0x00000080                  
 ffff8801fada5be0 0000000000000046 ffff88021a3bcc00 0000000300000001            
 ffff880218b92200 ffff880218668f50 0000000000000282 0000000000000003            
 0000000000000001 0000000000000282 ffff88021bfb45f8 ffff8801fada5fd8            
Call Trace:                                                                     
 [<ffffffff8140e439>] ? md_wakeup_thread+0x39/0x70                              
 [<ffffffffa02b81dd>] sync_request+0x26d/0xb90 [raid1]                          
 [<ffffffff81064b90>] ? default_wake_function+0x0/0x20                          
 [<ffffffff810577e9>] ? __wake_up_common+0x59/0x90                              
 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40                       
 [<ffffffff8105bd23>] ? __wake_up+0x53/0x70                                     
 [<ffffffff81414a47>] md_do_sync+0x6c7/0xd20                                    
 [<ffffffff814154f5>] md_thread+0x115/0x150                                     
 [<ffffffff814153e0>] ? md_thread+0x0/0x150                                     
 [<ffffffff8109e66e>] kthread+0x9e/0xc0                                         
 [<ffffffff8100c20a>] child_rip+0xa/0x20                                        
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0                                        
 [<ffffffff8100c200>] ? child_rip+0x0/0x20                                      

and...

INFO: task kjournald:10732 blocked for more than 120 seconds.                   
      Not tainted 2.6.32-497.el6.x86_64 #1                                      
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.       
kjournald     D 0000000000000008     0 10732      2 0x00000080                  
 ffff8801d71238b0 0000000000000046 0000000000000000 0000000300000001            
 ffff880218b92200 ffff880218668f50 000003783fd7424b 0000000000000003            
 0000000000000001 000000010035a163 ffff880219135af8 ffff8801d7123fd8            
Call Trace:                                                                     
 [<ffffffffa02b463f>] wait_barrier+0xcf/0x280 [raid1]                           
 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40                       
 [<ffffffffa02b71bf>] make_request+0x8f/0xe40 [raid1]                           
 [<ffffffff81126af5>] ? mempool_alloc_slab+0x15/0x20                            
 [<ffffffff81126c93>] ? mempool_alloc+0x63/0x140                                
 [<ffffffff811c9d6b>] ? bio_alloc_bioset+0x5b/0xf0                              
 [<ffffffffa02cf018>] raid_map+0x18/0x20 [dm_raid]

Bisecting has revealed that the commit that is the subject of this thread is the root of the problem.  It does take a long time to reproduce the issue with the scripts that I have, but if there is info I can get you, let me know.

 brassow

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html