Re: Problem with disk replacement

qindehua <qindehua@xxxxxxx> · Fri, 16 Aug 2013 18:26:25 +0800 (CST)

Hi Neil,

There is a WARN_ON in the patch "md/raid5: fix interaction of 'replace' and 'recovery'", which can happen by following test,
is there any problem?

Here is the steps to produce the WARNING messages:
1. create 3-drives RAID5 with --force
   mdadm -C /dev/md1 --force -l 5 -n 3 /dev/sd[b-d]

2. while the raid is resyncing, replace one disk with follow commands:
   mdadm /dev/md1 -a /dev/sde
   mdadm /dev/md1 --replace /dev/sdb
   echo "idle" >  /sys/block/md1/md/sync_action

After a while the raid5 module prints lots of WARNING messages which is introduced by the patch of "md/raid5: fix interaction of 'replace' and 'recovery'". The code context is:

	if ((s.replacing || s.syncing) && s.locked == 0
	    && !test_bit(STRIPE_COMPUTE_RUN, &sh->state)
	    && !test_bit(STRIPE_REPLACED, &sh->state)) {
		/* Write out to replacement devices where possible */
		for (i = 0; i < conf->raid_disks; i++)
			if (test_bit(R5_NeedReplace, &sh->dev[i].flags)) {
				WARN_ON(!test_bit(R5_UPTODATE, &sh->dev[i].flags));
				set_bit(R5_WantReplace, &sh->dev[i].flags);
				set_bit(R5_LOCKED, &sh->dev[i].flags);
				s.locked++;
			}

The WARNING messages:
[ 1667.432071] ------------[ cut here ]------------
[ 1667.433522] WARNING: at drivers/md/raid5.c:3611 handle_stripe+0x1175/0x1e15()
[ 1667.435566] Hardware name: VirtualBox
[ 1667.438395] Modules linked in: iscsi_scst(O) scst_vdisk(O) scst(O) device_buffer(O) mem_mgnt(PO) fuse
[ 1667.441955] Pid: 4670, comm: md1_raid5 Tainted: P        W  O 3.9.6 #1
[ 1667.443551] Call Trace:
[ 1667.445093]  [<ffffffff8103f0ea>] warn_slowpath_common+0x7a/0xc0
[ 1667.445093]  [<ffffffff8103f145>] warn_slowpath_null+0x15/0x20
[ 1667.447816]  [<ffffffff815d8479>] handle_stripe+0x1175/0x1e15
[ 1667.449638]  [<ffffffff812fe5da>] ? rb_erase_init+0x23/0x30
[ 1667.451527]  [<ffffffff81126572>] ? kmem_cache_alloc+0x62/0x140
[ 1667.453506]  [<ffffffff812e8b52>] ? sg_set_page+0x29/0x3f
[ 1667.455304]  [<ffffffff812e9445>] ? __blk_segment_map_sg+0x209/0x22c
[ 1667.457224]  [<ffffffff812e96fa>] ? blk_rq_map_sg+0x292/0x29e
[ 1667.459684]  [<ffffffff814aa2c2>] ? ahci_qc_prep+0x132/0x180
[ 1667.461397]  [<ffffffff81859699>] ? _raw_spin_unlock_irqrestore+0x9/0x10
[ 1667.462912]  [<ffffffff815e13b4>] ? test_ti_thread_flag+0x24/0x26
[ 1667.464560]  [<ffffffff815e1952>] ? test_tsk_thread_flag+0x24/0x26
[ 1667.466207]  [<ffffffff815e1971>] ? signal_pending+0x1d/0x27
[ 1667.468156]  [<ffffffff815fba4e>] ? md_check_recovery+0x60/0xcc3
[ 1667.470134]  [<ffffffff81455292>] ? put_device+0x12/0x20
[ 1667.471449]  [<ffffffff81472c2a>] ? scsi_request_fn+0x9a/0x4a0
[ 1667.473183]  [<ffffffff812fd3f9>] ? blk_rq_sectors+0x18/0x1d
[ 1667.475422]  [<ffffffff81077ebd>] ? dequeue_task_fair+0x40d/0x920
[ 1667.477118]  [<ffffffff810744b0>] ? set_next_entity+0xa0/0xc0
[ 1667.477118]  [<ffffffff810754b3>] ? pick_next_task_fair+0x63/0x140
[ 1667.480396]  [<ffffffff815c9b0e>] ? list_del_init+0x24/0x26
[ 1667.483181]  [<ffffffff815f7907>] ? md_revalidate+0x30/0x30
[ 1667.484935]  [<ffffffff815dc76c>] handle_active_stripes+0x7f/0xeb
[ 1667.486274]  [<ffffffff815dc93c>] raid5d+0x164/0x20a
[ 1667.488076]  [<ffffffff815f7b9b>] md_thread+0x294/0x2b3
[ 1667.488076]  [<ffffffff81062750>] ? add_wait_queue+0x60/0x60
[ 1667.491532]  [<ffffffff81061ebb>] kthread+0xbb/0xc0
[ 1667.493266]  [<ffffffff81061e00>] ? flush_kthread_work+0x120/0x120
[ 1667.494953]  [<ffffffff8186122c>] ret_from_fork+0x7c/0xb0
[ 1667.496337]  [<ffffffff81061e00>] ? flush_kthread_work+0x120/0x120
[ 1667.498159] ---[ end trace 6c53bec353931415 ]---

Regards,
Qin Dehua

At 2013-07-25 15:32:53,NeilBrown <neilb@xxxxxxx> wrote:
>On Thu, 25 Jul 2013 16:45:31 +1000 NeilBrown <neilb@xxxxxxx> wrote:
>
>Hi again.
>
>Here is the patch which I think is correct and I hope to submit tomorrow.
>
>NeilBrown
>
>
>From f94c0b6658c7edea8bc19d13be321e3860a3fa54 Mon Sep 17 00:00:00 2001
>From: NeilBrown <neilb@xxxxxxx>
>Date: Mon, 22 Jul 2013 12:57:21 +1000
>Subject: [PATCH] md/raid5: fix interaction of 'replace' and 'recovery'.
>
>If a device in a RAID4/5/6 is being replaced while another is being
>recovered, then the writes to the replacement device currently don't
>happen, resulting in corruption when the replacement completes and the
>new drive takes over.
>
>This is because the replacement writes are only triggered when
>'s.replacing' is set and not when the similar 's.sync' is set (which
>is the case during resync and recovery - it means all devices need to
>be read).
>
>So schedule those writes when s.replacing is set as well.
>
>In this case we cannot use "STRIPE_INSYNC" to record that the
>replacement has happened as that is needed for recording that any
>parity calculation is complete.  So introduce STRIPE_REPLACED to
>record if the replacement has happened.
>
>For safety we should also check that STRIPE_COMPUTE_RUN is not set.
>This has a similar effect to the "s.locked == 0" test.  The latter
>ensure that now IO has been flagged but not started.  The former
>checks if any parity calculation has been flagged by not started.
>We must wait for both of these to complete before triggering the
>'replace'.
>
>Add a similar test to the subsequent check for "are we finished yet".
>This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
>but it makes it more obvious that the REPLACE will happen before we
>think we are finished.
>
>Finally if a NeedReplace device is not UPTODATE then that is an
>error.  We really must trigger a warning.
>
>This bug was introduced in commit 9a3e1101b827a59ac9036a672f5fa8d5279d0fe2
>(md/raid5:  detect and handle replacements during recovery.)
>which introduced replacement for raid5.
>That was in 3.3-rc3, so any stable kernel since then would benefit
>from this fix.
>
>Cc: stable@xxxxxxxxxxxxxxx (3.3+)
>Reported-by: qindehua <13691222965@xxxxxxx>
>Tested-by: qindehua <qindehua@xxxxxxx>
>Signed-off-by: NeilBrown <neilb@xxxxxxx>
>
>diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>index 2bf094a..78ea443 100644
>--- a/drivers/md/raid5.c
>+++ b/drivers/md/raid5.c
>@@ -3462,6 +3462,7 @@ static void handle_stripe(struct stripe_head *sh)
> 		    test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
> 			set_bit(STRIPE_SYNCING, &sh->state);
> 			clear_bit(STRIPE_INSYNC, &sh->state);
>+			clear_bit(STRIPE_REPLACED, &sh->state);
> 		}
> 		spin_unlock(&sh->stripe_lock);
> 	}
>@@ -3607,19 +3608,23 @@ static void handle_stripe(struct stripe_head *sh)
> 			handle_parity_checks5(conf, sh, &s, disks);
> 	}
> 
>-	if (s.replacing && s.locked == 0
>-	    && !test_bit(STRIPE_INSYNC, &sh->state)) {
>+	if ((s.replacing || s.syncing) && s.locked == 0
>+	    && !test_bit(STRIPE_COMPUTE_RUN, &sh->state)
>+	    && !test_bit(STRIPE_REPLACED, &sh->state)) {
> 		/* Write out to replacement devices where possible */
> 		for (i = 0; i < conf->raid_disks; i++)
>-			if (test_bit(R5_UPTODATE, &sh->dev[i].flags) &&
>-			    test_bit(R5_NeedReplace, &sh->dev[i].flags)) {
>+			if (test_bit(R5_NeedReplace, &sh->dev[i].flags)) {
>+				WARN_ON(!test_bit(R5_UPTODATE, &sh->dev[i].flags));
> 				set_bit(R5_WantReplace, &sh->dev[i].flags);
> 				set_bit(R5_LOCKED, &sh->dev[i].flags);
> 				s.locked++;
> 			}
>-		set_bit(STRIPE_INSYNC, &sh->state);
>+		if (s.replacing)
>+			set_bit(STRIPE_INSYNC, &sh->state);
>+		set_bit(STRIPE_REPLACED, &sh->state);
> 	}
> 	if ((s.syncing || s.replacing) && s.locked == 0 &&
>+	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
> 	    test_bit(STRIPE_INSYNC, &sh->state)) {
> 		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
> 		clear_bit(STRIPE_SYNCING, &sh->state);
>diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>index b0b663b..70c4932 100644
>--- a/drivers/md/raid5.h
>+++ b/drivers/md/raid5.h
>@@ -306,6 +306,7 @@ enum {
> 	STRIPE_SYNC_REQUESTED,
> 	STRIPE_SYNCING,
> 	STRIPE_INSYNC,
>+	STRIPE_REPLACED,
> 	STRIPE_PREREAD_ACTIVE,
> 	STRIPE_DELAYED,
> 	STRIPE_DEGRADED,
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html