RE: raid1: prevent adding a "too recent" device to a mirror?

"Dailey, Nate" <Nate.Dailey@xxxxxxxxxxx> · Fri, 6 Aug 2010 13:14:25 -0400

Does the following patch look something like what you had in mind?

I originally did this in raid1 with a RHEL 5 (2.6.18) kernel because
that's where I need the fix, but the patch below is against md.c from a
RHEL 6 (2.6.32) kernel (since it's less removed from the latest
kernel.org kernel).

This seems to work correctly for me, only tested with raid1. Some notes:

- I had to return '0' in the add_new_disk case, because if I returned an
error mdadm would try again, with a "normal add" instead of a "re-add"
(meaning the disk would be added, and a full sync performed).

- I had to modify super_1_sync to completely fill dev_roles, instead of
just filling it in for those devices currently present in the array
(without this change, starting a raid1 with only 1 device could cause
the superblock to show two active devices, when in fact one device
should have been marked faulty).

Nate

--- md.c.orig	2010-08-06 08:53:23.000000000 -0400
+++ md.c	2010-08-06 12:49:53.000000000 -0400
@@ -854,6 +854,53 @@ static unsigned int calc_sb_csum(mdp_sup
 	return csum;
 }
 
+/* Check if an rdev to be added to an mddev is split, meaning it had
been
+   assembled without including devices currently in the array. In this
case,
+   the rdev's superblock will show faulty/removed devices which are not
+   currently faulty in the mddev. */
+static int is_split_rdev(mddev_t *mddev, mdk_rdev_t *rdev)
+{
+	mdk_rdev_t *rdev2;
+
+	list_for_each_entry(rdev2, &mddev->disks, same_set) {
+
+		if (rdev2 == rdev) {
+			continue;
+		}
+
+		if (test_bit(Faulty, &rdev2->flags) || (rdev2->desc_nr <
0)) {
+			continue;
+		}
+
+		if (mddev->major_version == 0) {
+
+			mdp_super_t *sb;
+			mdp_disk_t *d;
+
+			sb = (mdp_super_t *)page_address(rdev->sb_page);
+			d = &sb->disks[rdev2->desc_nr];
+
+			if (d->state &
+			    (1<<MD_DISK_FAULTY | 1<<MD_DISK_REMOVED)) {
+				return 1;
+			}
+
+		} else if (mddev->major_version == 1) {
+
+			struct mdp_superblock_1 *sb;
+			int role;
+
+			sb = (struct
mdp_superblock_1*)page_address(rdev->sb_page);
+			role =
le16_to_cpu(sb->dev_roles[rdev2->desc_nr]);
+
+			if (role == 0xfffe) {
+				return 1;
+			}
+		}
+	}
+
+	return 0;
+}
 
 /*
  * Handle superblock details.
@@ -1592,7 +1639,10 @@ static void super_1_sync(mddev_t *mddev,
 		if (rdev->sb_size & bmask)
 			rdev->sb_size = (rdev->sb_size | bmask) + 1;
 	}
-	for (i=0; i<max_dev;i++)
+
+	/* Use sb->max_dev instead of max_dev so an absent device with
+	   a higher desc_nr is marked faulty. */
+	for (i=0; i<le32_to_cpu(sb->max_dev); i++)
 		sb->dev_roles[i] = cpu_to_le16(0xfffe);
 	
 	list_for_each_entry(rdev2, &mddev->disks, same_set) {
@@ -4338,6 +4388,15 @@ static int md_run(mddev_t *mddev)
 		analyze_sbs(mddev);
 	}
 
+	/* Refuse to start a split array. */
+	list_for_each_entry(rdev, &mddev->disks, same_set) {
+		if (is_split_rdev(mddev, rdev)) {
+			printk("md: %s: split array, manual
intervention"
+			       " required\n", mdname(mddev));
+			return -EINVAL;
+		}
+	}
+
 	if (mddev->level != LEVEL_NONE)
 		request_module("md-level-%d", mddev->level);
 	else if (mddev->clevel[0])
@@ -5081,6 +5140,16 @@ static int add_new_disk(mddev_t * mddev,
 				PTR_ERR(rdev));
 			return PTR_ERR(rdev);
 		}
+
+		/* Don't add a device that's been split from the array.
*/
+		if (is_split_rdev(mddev, rdev)) {
+			printk("md: %s: kicking %s from array:"
+			       " split array, manual intervention
required\n",
+			       mdname(mddev), bdevname(rdev->bdev,b));
+			export_rdev(rdev);
+			return 0; // so mdadm doesn't try again
+		}
+
 		/* set save_raid_disk if appropriate */
 		if (!mddev->persistent) {
 			if (info->state & (1<<MD_DISK_SYNC)  &&





-----Original Message-----
From: Neil Brown [mailto:neilb@xxxxxxx] 
Sent: Thursday, July 22, 2010 11:32 PM
To: Dailey, Nate
Cc: linux-raid@xxxxxxxxxxxxxxx
Subject: Re: raid1: prevent adding a "too recent" device to a mirror?

On Thu, 22 Jul 2010 13:40:19 -0400
"Dailey, Nate" <Nate.Dailey@xxxxxxxxxxx> wrote:

> Hi, I'm looking for some guidance... not sure if this is a bug, or
just
> something that's not supposed to work:
> 
> - I'm working with a 2-disk raid1 with an internal bitmap, might also
> apply to other personalities
> - remove one disk from the raid1
> - make some changes to the remaining disk in the raid1
> - stop the raid1
> - assemble the raid1 using only the disk that had previously been
> removed (the less-recent disk)
> - now, add the other disk (the more-recent disk)

Yes, this is a problem scenario.

I think the right way to address it is to notice that each device thinks
that
the other is failed/removed and determine that they must be completely
out of
sync with each other, and refuse to allow the both in the same array
without
completely re-initialising one of them.

Your idea of rejecting devices with event counts larger than the bitmap
knows
about might be useful too, though I would do it in bitmap.c (which you
cannot
do as you don't want to change md-mod.ko)

Thanks,
NeilBrown


> 
> A resync occurs, but doesn't actually bring the disks into sync (doing
a
> check shows non-zero mismatch_cnt). Presumably this is because the
> bitmap on the less-recent disk has no idea of the changes made after
it
> was taken out of the mirror.
> 
> Probably not a common occurrence, but not unthinkable for this to
happen
> (in the case I'm dealing with, it was booting from a single disk from
a
> raid1 that had been set aside as a backup, then adding a more-recent
> disk to the mirror).
> 
> Is this something that should be fixed? I've been playing around with
> this, and have something that seems to work for me (keeping the change
> entirely in raid1... need to run a standard RHEL 5.5 kernel, so I
can't
> change the MD core). The basic idea is that I detect if the disk's
> superblock events counter is > the bitmap's events_cleared counter,
and
> refuse to add the disk to the raid1 in that case because it means that
> the disk being added had been assembled at some point after being
> removed from the raid1. Could instead trigger a full sync in this
case,
> though kicking the disk out allows data to be recovered from it.
> 
> There could be a similar problem if something like this happened for
an
> array without a bitmap, though in that case it would only mean loss of
> the changed data on the more-recent disk, not an incomplete sync. I
> couldn't think of an easy way around this one.
> 
> Anyway, following is a portion of the patch I'm using. Is something
like
> this generally applicable to MD? If not, do you see any problems with
my
> approach, even if it's not acceptible upstream?
> 
> --- raid1.c.orig	2010-07-22 13:31:14.000000000 -0400
> +++ raid1.c	2010-07-21 09:53:34.000000000 -0400
> 
> ... snip copies of unexported functions taken from md.c, my copies
start
> with raid1_ ...
> 
> @@ -1094,6 +1152,69 @@ static int raid1_add_disk(mddev_t *mddev
>  	int mirror = 0;
>  	mirror_info_t *p;
>  
> +	/* We need to protect against adding a "more-recent" device to
> an
> +	   array with a bitmap. Adding such a device will result in an
> +	   incomplete sync. This could happen if a device were removed
> +	   from an array, leaving the array running with only the more-
> +	   recent device. If the array were later restarted using only
> +	   the "less-recent" device, then we must prevent adding the
> more-
> +	   recent device back into the array (since the less-recent
> bitmap
> +	   wouldn't reflect changes made to the more-recent device).
> +
> +	   We depend on events_cleared to detect this condition, since
> it's
> +	   not incremented for a degraded array. Need to use sb->events_
> +	   cleared because bitmap->events_cleared isn't updated. An out-
> +	   of-date bitmap would cause events_cleared to change, and
> might
> +	   allow the more-recent device to be added to the array. This
> isn't
> +	   so bad, because this condition would also trigger a full
> sync. So,
> +	   data corruption would be avoided, but any more-recent data on
> the
> +	   more-recent device would be lost.
> +
> +	   We check saved_raid_disk >= 0 because we want to always allow
> +	   adds when a full sync will be done (saved_raid_disk < 0
> triggers
> +	   conf->fullsync = 1, below).
> +
> +	   This is not so much of a problem for an array without a
> bitmap,
> +	   since a full sync will always be done. However, it would
> still
> +	   be nice to prevent adding the more-recent device in this case
> +	   to avoid losing more-recent data on that device.
> +	*/
> +	if ((rdev->saved_raid_disk >= 0) && mddev->bitmap) {
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&mddev->bitmap->lock, flags);
> +		if (!mddev->bitmap->sb_page) {
> +			spin_unlock_irqrestore(&mddev->bitmap->lock,
> flags);
> +		} else {
> +			bitmap_super_t *bm_sb;
> +			mdp_super_t *sb;
> +			__u64 events, bm_events_cleared;
> +
> +			spin_unlock_irqrestore(&mddev->bitmap->lock,
> flags);
> +
> +			bm_sb = (bitmap_super_t *)
> +				kmap_atomic(mddev->bitmap->sb_page,
> KM_USER0);
> +			bm_events_cleared =
> le64_to_cpu(bm_sb->events_cleared);
> +			kunmap_atomic(bm_sb, KM_USER0);
> +
> +			sb = (mdp_super_t *)page_address(rdev->sb_page);
> +			events = md_event(sb);
> +
> +			if ((events & ~1) > bm_events_cleared) {
> +				char b[BDEVNAME_SIZE];
> +
> +				printk("raid1: %s: kicking %s from array
> (bad"
> +				       " event count %llu > %llu
> cleared),"
> +				       " manual intervention
> required\n",
> +				       mdname(mddev),
> bdevname(rdev->bdev,b),
> +				       events, bm_events_cleared);
> +				raid1_unbind_rdev_from_array(rdev);
> +				raid1_export_rdev(rdev);
> +				return 0;
> +			}
> +		}
> +	}
> +
>  	for (mirror=0; mirror < mddev->raid_disks; mirror++)
>  		if ( !(p=conf->mirrors+mirror)->rdev) {
> 
> 
> 
> Thanks for any advice you can provide,
> 
> Nate Dailey
> Stratus Technologies
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html