Re: CRUSH commit and confirm mode

Wido den Hollander <wido@xxxxxxxx> · Tue, 15 Jan 2019 13:49:50 +0100

On 1/14/19 10:31 PM, Sage Weil wrote:
> On Mon, 14 Jan 2019, Wido den Hollander wrote:
>> Hi,
>>
>> Having CRUSH updates on a OSD start is something which is very useful.
>> Tools like Ansible, Puppet and Salt can provision ceph.conf or other
>> scripts which can be run as hooks to inject OSDs as the right location
>> in the CRUSHMap.
>>
>> Something that is lacking (imho) is a 'commit and confirm' mode.
>>
>> Right now, after you deploy a new OSD with ceph-volume it's created as
>> an new OSD and also injected into the CRUSHMap. Topology changes right
>> away and backfills start.
>>
>> In certain scenarios it would be great if these OSDs would be added to
>> the OSDMap and the CRUSH changes are staged in the MONs, but not
>> committed yet.
>>
>> This way the OSDs start to talk with the MONs and you can perform some
>> tests on them. The Mgr daemons start to collect data from them (although
>> they are idle).
>>
>> An idea would be that you can set the MONs in 'CRUSH commit mode' like:
>>
>> $ ceph osd crush commit
>>
>> The cluster now goes into WARN mode and all changes to the CRUSHMap are
>> staged, but not live yet.
>>
>> $ ceph osd crush diff
>>
>> This will show you the changes between the active CRUSHMap and the
>> changes which are staged.
>>
>> Once you finished deploying your OSDs, testing everything you can run:
>>
>> $ ceph osd crush confirm
>>
>> Or, if you think the changes should be discarded:
>>
>> $ ceph osd crush discard
>>
>> After you 'confirm' the changes a new CRUSHMap is generated by the
>> Monitors and send out to the cluster.
>>
>> This also prevents creating a new OSDMap every time an OSD is added.
>> Adding 200 OSDs would generate one CRUSH change instead of 200 small ones.
>>
>> Does this idea sound like a sane idea?
> 
> 
> Originally, the idea was to use
> 
>  osd_crush_initial_weight = 0
> 
> which would add the OSD in the correct position but leave its CRUSH 
> weight at 0.  That isn't ideal, though, since you have to go set the real 
> size/weight manually.
> 

That works indeed, except when a host bucket is created. That still
causes a data migration because a new host was added to the CRUSHMap.

Downside here is that you need to set the weight of the OSD manually.

What I see in most cases is that people add OSDs by adding complete
hosts and in those situations setting the weight to 0 does not work as
backfills still start.

And when adding multiple nodes you have backfills start and being
canceled due to changes which happen after adding node 2, 3, 4, etc.

> If you're using the compat weight-set balancer mode, what would be better 
> is to instead set 
> 
>  osd_crush_update_weight_set = false
> 

Tried this on a Mimic 13.2.4 test cluster, didn't work. The OSD as still
added with it's expected weight and backfills started to send data to
the OSD.

> This means that the CRUSH weight would still be set to the size in TiB (as 
> usual) but the weight-set value would be 0, so it doesn't actually get any 
> data initially.  Then let the balancer work its magic in the background to 
> ramp the weight slowly and migrate data.

> I don't think this was ever tested, but this was how it was intended to 
> work.  I'd love to hear if it works in practice (or how badly it falls 
> over!).  If it does behave as intended, we might even consider making this 
> behavior the default...
> 
> sage
>