Re: CRUSH commit and confirm mode

Wido den Hollander <wido@xxxxxxxx> · Tue, 15 Jan 2019 09:27:17 +0100

On 1/14/19 10:31 PM, Sage Weil wrote:
> On Mon, 14 Jan 2019, Wido den Hollander wrote:
>> Hi,
>>
>> Having CRUSH updates on a OSD start is something which is very useful.
>> Tools like Ansible, Puppet and Salt can provision ceph.conf or other
>> scripts which can be run as hooks to inject OSDs as the right location
>> in the CRUSHMap.
>>
>> Something that is lacking (imho) is a 'commit and confirm' mode.
>>
>> Right now, after you deploy a new OSD with ceph-volume it's created as
>> an new OSD and also injected into the CRUSHMap. Topology changes right
>> away and backfills start.
>>
>> In certain scenarios it would be great if these OSDs would be added to
>> the OSDMap and the CRUSH changes are staged in the MONs, but not
>> committed yet.
>>
>> This way the OSDs start to talk with the MONs and you can perform some
>> tests on them. The Mgr daemons start to collect data from them (although
>> they are idle).
>>
>> An idea would be that you can set the MONs in 'CRUSH commit mode' like:
>>
>> $ ceph osd crush commit
>>
>> The cluster now goes into WARN mode and all changes to the CRUSHMap are
>> staged, but not live yet.
>>
>> $ ceph osd crush diff
>>
>> This will show you the changes between the active CRUSHMap and the
>> changes which are staged.
>>
>> Once you finished deploying your OSDs, testing everything you can run:
>>
>> $ ceph osd crush confirm
>>
>> Or, if you think the changes should be discarded:
>>
>> $ ceph osd crush discard
>>
>> After you 'confirm' the changes a new CRUSHMap is generated by the
>> Monitors and send out to the cluster.
>>
>> This also prevents creating a new OSDMap every time an OSD is added.
>> Adding 200 OSDs would generate one CRUSH change instead of 200 small ones.
>>
>> Does this idea sound like a sane idea?
> 
> 
> Originally, the idea was to use
> 
>  osd_crush_initial_weight = 0
> 
> which would add the OSD in the correct position but leave its CRUSH 
> weight at 0.  That isn't ideal, though, since you have to go set the real 
> size/weight manually.
> 
> If you're using the compat weight-set balancer mode, what would be better 
> is to instead set 
> 
>  osd_crush_update_weight_set = false
> 
> This means that the CRUSH weight would still be set to the size in TiB (as 
> usual) but the weight-set value would be 0, so it doesn't actually get any 
> data initially.  Then let the balancer work its magic in the background to 
> ramp the weight slowly and migrate data.
> 
> I don't think this was ever tested, but this was how it was intended to 
> work.  I'd love to hear if it works in practice (or how badly it falls 
> over!).  If it does behave as intended, we might even consider making this 
> behavior the default...
> 

Would that work? Because even though the OSDs are added with a zero (0)
weight in the CRUSHMap the topology still changes.

It could be a new host or an additional OSD and that already changes the
CRUSHmap even if these new OSDs have a weight of 0.

In the cases where I use the balancer it will be using the upmap as that
is far more effective than the compat mode.

Hence my original idea: commit and confirm

The MONs would keep a temp OSDMap and commit all their changes on that
map. Once you 'confirm' the changes they will be merged into the active
OSDMap and send these out to the cluster.

Wido

> sage
>