Re: [PATCH] md/bitmap: don't read page from device with Bitmap_sync

Shaohua Li <shli@xxxxxxxxxx> · Wed, 21 Jun 2017 10:47:19 -0700

On Tue, Jun 20, 2017 at 01:59:45PM +1000, Neil Brown wrote:
> On Mon, Jun 19 2017, Shaohua Li wrote:
> 
> > On Mon, Jun 19, 2017 at 09:11:23AM +1000, Neil Brown wrote:
> >> On Fri, Jun 16 2017, Shaohua Li wrote:
> >> 
> >> > On Fri, Jun 16, 2017 at 05:19:27PM +0800, Guoqing Jiang wrote:
> >> >> The device owns Bitmap_sync flag needs recovery
> >> >> to become in sync, and read page from this type
> >> >> device could get stale status.
> >> >> 
> >> >> Signed-off-by: Guoqing Jiang <gqjiang@xxxxxxxx>
> >> >> ---
> >> >> When develop for clustered raid10 feature, if add a
> >> >> disk under grow mode in master node, I could get
> >> >> the "bitmap superblock UUID mismatch" warning due
> >> >> to the page is read from Bitmap_sync device.
> >> >> 
> >> >>  drivers/md/bitmap.c | 3 ++-
> >> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >> >> 
> >> >> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> >> >> index bf7419a..bf34cd8 100644
> >> >> --- a/drivers/md/bitmap.c
> >> >> +++ b/drivers/md/bitmap.c
> >> >> @@ -156,7 +156,8 @@ static int read_sb_page(struct mddev *mddev, loff_t offset,
> >> >>  
> >> >>  	rdev_for_each(rdev, mddev) {
> >> >>  		if (! test_bit(In_sync, &rdev->flags)
> >> >
> >> > Why In_sync is set for the rdev? I think it shoudn't.
> >> 
> >> There are several states a device can be in.
> >> If ->raid_disk is < 0, then the device is a spare, and doesn't contain
> >>     and valid data.
> >> Otherwise ->raid_disk >=0 and:
> >>    In_sync is clear, which means that blocks before ->recovery_offset
> >>        contain valid data, and blocks after there don't
> >>    In_sync is set, which means ->recovery_offset is irrelevant and
> >>        should be treated as MaxSector
> >>    Bitmap_sync is set, which could apply in either of the above cases,
> >>        and means blocks corresponding to bits that are set in the bitmap
> >>        may not be up to date.
> >> 
> >> Bitmap_sync is only relevant before a device has been handed to the
> >> personality.  After it has been added, ->recovery_cp ensures that the
> >> blocks that might be wrong are not read until until the bitmap-based
> >> recovery has fixed them up.
> >> 
> >> Bitmap_sync is only use to stop the device from being given to the
> >> personality if a recovery won't be started because the array is
> >> read-only.
> >> 
> >> So it is perfectly valid for both In_sync and Bitmap_sync to be set.
> >> If they are, it makes sense to avoid reading bitmap information from the
> >> Bitmap_sync device, as that will be out-of-date.
> >> 
> >> I'm not quite sure why Guoqing is getting a UUID mismatch ... it
> >> suggests that the new device wasn't initialized properly.  So there
> >> might be another bug.  But I think this is definitely a bug.
> >
> > Can you describe a scenario a disk has both bits set? I had hard time to
> > imagine it.
> 
> An array with a bitmap has a device fail (cable break?).
> So the bitmap stops clearing bits and ->events_cleared stays
> where it was, with the same value as events in the superblock
> of the failed device.
> 
> The device then starts working (cable is replaced) and
> it device is added back to the array.
> super_1_validate notices that the array has a bitmap and that the event
> count in the new device is >= mddev->bitmap->events_cleared, but <
> mddev->events.
> i.e. it is in the range for which a bitmap-based recovery is both
> possible and needed.
> So it sets Bitmap_sync.
> Then it notices that the 'role' in a valid role for the array,
> so that role is stored in rdev->saved_raid_disk and, as there is no
> MD_FEATURE_RECOVERY_OFFSET, In_sync is set.
> A similar sequence can happen in super_90_validate.

This part is quite confusing actually. The failed device is kicked out, so the
array is degraded. When we readd the device, we shouldn't set In_sync.

Actually we set In_sync in super_1_validate, we then clear it in add_bound_rdev
and reset it till recovery finishes. Set it till recovery finishes makes sense
to me, the super_1_validate is confusing.

> >
> > md_update_sb will update super and bitmap super at almost the same time. It's
> > possible we update super but not bitmap super and there is power loss. Is this
> > the case we have the both bits set? There seems no mechanism to prevent the
> > first disk has older bitmap super than the second disk in this scenario.
> 
> This isn't really about the superblock and bitmap on the one device
> being in sync or not.
> It is the superblock on one (failed) device having a different event
> count than the bitmap (on the other devices) that is actively being used
> by the array.

My point is we could use a disk's superblock, but its bitmap superblock is
stale (even is old)

Anyway, I still didn't understand when the problem could happen. Maybe Guoqing
can clarify.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html