Re: CRUSH commit and confirm mode

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 15 Jan 2019 13:26:45 +0000 (UTC)

On Tue, 15 Jan 2019, Wido den Hollander wrote:
> On 1/14/19 10:31 PM, Sage Weil wrote:
> > On Mon, 14 Jan 2019, Wido den Hollander wrote:
> >> Hi,
> >>
> >> Having CRUSH updates on a OSD start is something which is very useful.
> >> Tools like Ansible, Puppet and Salt can provision ceph.conf or other
> >> scripts which can be run as hooks to inject OSDs as the right location
> >> in the CRUSHMap.
> >>
> >> Something that is lacking (imho) is a 'commit and confirm' mode.
> >>
> >> Right now, after you deploy a new OSD with ceph-volume it's created as
> >> an new OSD and also injected into the CRUSHMap. Topology changes right
> >> away and backfills start.
> >>
> >> In certain scenarios it would be great if these OSDs would be added to
> >> the OSDMap and the CRUSH changes are staged in the MONs, but not
> >> committed yet.
> >>
> >> This way the OSDs start to talk with the MONs and you can perform some
> >> tests on them. The Mgr daemons start to collect data from them (although
> >> they are idle).
> >>
> >> An idea would be that you can set the MONs in 'CRUSH commit mode' like:
> >>
> >> $ ceph osd crush commit
> >>
> >> The cluster now goes into WARN mode and all changes to the CRUSHMap are
> >> staged, but not live yet.
> >>
> >> $ ceph osd crush diff
> >>
> >> This will show you the changes between the active CRUSHMap and the
> >> changes which are staged.
> >>
> >> Once you finished deploying your OSDs, testing everything you can run:
> >>
> >> $ ceph osd crush confirm
> >>
> >> Or, if you think the changes should be discarded:
> >>
> >> $ ceph osd crush discard
> >>
> >> After you 'confirm' the changes a new CRUSHMap is generated by the
> >> Monitors and send out to the cluster.
> >>
> >> This also prevents creating a new OSDMap every time an OSD is added.
> >> Adding 200 OSDs would generate one CRUSH change instead of 200 small ones.
> >>
> >> Does this idea sound like a sane idea?
> > 
> > 
> > Originally, the idea was to use
> > 
> >  osd_crush_initial_weight = 0
> > 
> > which would add the OSD in the correct position but leave its CRUSH 
> > weight at 0.  That isn't ideal, though, since you have to go set the real 
> > size/weight manually.
> > 
> > If you're using the compat weight-set balancer mode, what would be better 
> > is to instead set 
> > 
> >  osd_crush_update_weight_set = false
> > 
> > This means that the CRUSH weight would still be set to the size in TiB (as 
> > usual) but the weight-set value would be 0, so it doesn't actually get any 
> > data initially.  Then let the balancer work its magic in the background to 
> > ramp the weight slowly and migrate data.
> > 
> > I don't think this was ever tested, but this was how it was intended to 
> > work.  I'd love to hear if it works in practice (or how badly it falls 
> > over!).  If it does behave as intended, we might even consider making this 
> > behavior the default...
> > 
> 
> Would that work? Because even though the OSDs are added with a zero (0)
> weight in the CRUSHMap the topology still changes.
> 
> It could be a new host or an additional OSD and that already changes the
> CRUSHmap even if these new OSDs have a weight of 0.

A zero weight item won't affect placement.

> In the cases where I use the balancer it will be using the upmap as that
> is far more effective than the compat mode.

...but you're right, it won't help in the upmap case.

> Hence my original idea: commit and confirm
> 
> The MONs would keep a temp OSDMap and commit all their changes on that
> map. Once you 'confirm' the changes they will be merged into the active
> OSDMap and send these out to the cluster.

The problem with this is that it's hard/impossible for the mon to 
distinguish between two different streams of osdmap updates: those from 
the human who is going to interactively say "ok, looks good" and those 
from other cluster activity (pgs peering who need their up_thru value 
changed, other osds that happened to fail or (re)start during this period, 
and so on.

If it's specifically OSD addition that we're worried about, we should 
address that specifically.  Either,

 - OSDs aren't added into position at all except in a batch, 
interactively, by an administrator, or
 - OSDs are added with weight 0, and their weights are set to the non-zero 
targets interactively, in a batch, by an administrator.

In the crush-compat case, it feels like we cover this already with the 
osd_crush_update_weight_set = false option (assuming we test and verify it 
works as expected).  For upmap, we could do something similar, where the 
new OSD is added but all PGs that would get mapped to it are upmap'ed 
away initially (similar to prime_pg_temp).  And if the balancer is off 
entirely, then 'osd_crush_initial_weight = 0' seems like the right thing.

Alternatively, the 'osd_crush_initial_weight = 0' (or a similar option) 
could be changed so that when teh osd is added it records it's real weight 
somewhere else but leaves the crush weight to 0, and a health alert comes 
up saying there are N new OSDs pending final inclusion, and a 
single buttom/command sets them, similar to your commit (and diff etc) 
commands above.

Is it necessary to stage crush changes other than OSD additions?  I just 
worry that there are cases where staging a change will break something.  
Like the addition of a new crush rule to create a new pool.  Or the 
balancer trying to create a compat-set.  Or something else...

sage