RE: [md PATCH 2/5] md: Enable reshape for external metadata

"Trela, Maciej" <Maciej.Trela@xxxxxxxxx> · Thu, 17 Jun 2010 14:19:22 +0100

> -----Original Message-----
> From: Neil Brown [mailto:neilb@xxxxxxx]
> Sent: Thursday, June 17, 2010 12:35 PM
> To: Trela, Maciej
> Cc: Kwolek, Adam; linux-raid@xxxxxxxxxxxxxxx; Williams, Dan J;
> Ciechanowski, Ed
> Subject: Re: [md PATCH 2/5] md: Enable reshape for external metadata
> 
> On Thu, 17 Jun 2010 10:40:36 +0100
> "Trela, Maciej" <Maciej.Trela@xxxxxxxxx> wrote:
> 
> > >
> > > >
> > > > Another thing is waiting during reshape for metadata update on
> > > MD_CHANGE_DEVS flag.
> > > > To roll reshape I've added the following code (instead calling
> > > md_ubdate_sb()):
> > >
> > > Yes, there is a real issue there...
> > >
> > > I don't think we ever need the kernel to wait for an external
> metadata
> > > handler
> > > to respond to device changes (apart from failure which is handled
> > > separately).
> > > So maybe the best thing is to guard all settings of MD_CHANGE_DEVS
> with
> > > if (mddev->persistent)
> > >
> > > I think that would be best, but I've make a note to review that
> later.
> > >
> >
> > Neil,
> > from what I see in the raid5.c/md.c "native" code uses MD_CHANGE_DEVS
> > during the reshape if it reaches special points when metadata
> > write is really needed to update the reshape checkpoint.
> > In reshape_request():
> > 	/* Cannot proceed until we've updated the superblock */
> > 	..
> > 	set_bit(MD_CHANGE_DEVS, mddev->flags)
> >
> > In md_check_recovery() we have:
> > 	if (mddev->flags)
> > 		md_update_sb()
> >
> > Couldn't we follow this logic with MD_CHANGE_DEVS for external
> metadata?
> > If not, how to detect the need for migration checkpoint update?
> 
> Good question.
> The first question to ask is
>   How does mdmon know when a metadata update is required, and how does
>   it tell md that the metadata update is complete.
> 
> OK, 2 first questions...
> 
> For the first I suspect it should watch 'md/reshape_position' (which
> need to
> use sysfs_notify for).

Yes, I agree here.

> For the second .... I don't know.
> - Maybe sync_action could change to 'paused' and mdmon writes
> 'continue'....
>   but that is possibly overloading that file too much.
> - We could have a new sysfs file which just shows paused/active ??
> - We could require that mdmon sets 'sync_max' appropriately so that
> reshape
>   will stop at the right place, and then when mdmon has updated the
> metadata,
>   it sets a new sync_max value.
> - As above, but if sync_max is set too high, it is automatically
> reduced
>   to the place when raid5 finds that it has to stop
> 
> I think the last one is probably best.
> Before updating ->reshape_position, raid5 checks ->resync_max and if it
> is
> too high for safety it set is lower to a safer value.
> Then it changes ->reshape_position and calls sysfs_notify.
> 
> mdmon watches for 'reshape_postion' to change.  when it does it updates
> the
> metadata and then writes a larger value to ->resync_max.
> 
> Things can get a little confusing when reshaping to fewer devices as
> reshape_position decreases, but sync_completed always increases and
> sync_max
> is still an 'upper' limit.
> 
> But it should work OK.
> 
> Does that seem reasonable?
> 

Yes, it looks ok to me.
However, I have one doubt:
Let's assume we have child_grow() reshape type:

1. child_grow() performs backup for N stripes and sets 
	resync_max = N * stripe_size

2. md start reshaping

3. For one of the first stripes = M (where M < N) md decides to
   update metadata and notifies mdmon with the new reshape_position = M
   However, before that, it changes resync_max = M so md_do_sync main
   loop will wait for a new resync_max value.

4. mdmon wakes on reshape_position, updates meta and it should
   set the new value for resync_max to confirm the checkpoint.
   I think that resync_max should be set to N at this moment but mdmon
   is not able to do that... 
   it does not know the original resync_max value (set in Grow.c)

what do you think?

Maybe using other sysfs file would be better?
I would vote for "resync_start" = "continue" or something like that.. :)

Maciek.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html