Re: CRUSH commit and confirm mode

Wido den Hollander <wido@xxxxxxxx> · Tue, 15 Jan 2019 20:39:18 +0100

On 1/15/19 2:26 PM, Sage Weil wrote:
> On Tue, 15 Jan 2019, Wido den Hollander wrote:
>> On 1/14/19 10:31 PM, Sage Weil wrote:
>>> On Mon, 14 Jan 2019, Wido den Hollander wrote:
>>>> Hi,
>>>>
>>>> Having CRUSH updates on a OSD start is something which is very useful.
>>>> Tools like Ansible, Puppet and Salt can provision ceph.conf or other
>>>> scripts which can be run as hooks to inject OSDs as the right location
>>>> in the CRUSHMap.
>>>>
>>>> Something that is lacking (imho) is a 'commit and confirm' mode.
>>>>
>>>> Right now, after you deploy a new OSD with ceph-volume it's created as
>>>> an new OSD and also injected into the CRUSHMap. Topology changes right
>>>> away and backfills start.
>>>>
>>>> In certain scenarios it would be great if these OSDs would be added to
>>>> the OSDMap and the CRUSH changes are staged in the MONs, but not
>>>> committed yet.
>>>>
>>>> This way the OSDs start to talk with the MONs and you can perform some
>>>> tests on them. The Mgr daemons start to collect data from them (although
>>>> they are idle).
>>>>
>>>> An idea would be that you can set the MONs in 'CRUSH commit mode' like:
>>>>
>>>> $ ceph osd crush commit
>>>>
>>>> The cluster now goes into WARN mode and all changes to the CRUSHMap are
>>>> staged, but not live yet.
>>>>
>>>> $ ceph osd crush diff
>>>>
>>>> This will show you the changes between the active CRUSHMap and the
>>>> changes which are staged.
>>>>
>>>> Once you finished deploying your OSDs, testing everything you can run:
>>>>
>>>> $ ceph osd crush confirm
>>>>
>>>> Or, if you think the changes should be discarded:
>>>>
>>>> $ ceph osd crush discard
>>>>
>>>> After you 'confirm' the changes a new CRUSHMap is generated by the
>>>> Monitors and send out to the cluster.
>>>>
>>>> This also prevents creating a new OSDMap every time an OSD is added.
>>>> Adding 200 OSDs would generate one CRUSH change instead of 200 small ones.
>>>>
>>>> Does this idea sound like a sane idea?
>>>
>>>
>>> Originally, the idea was to use
>>>
>>>  osd_crush_initial_weight = 0
>>>
>>> which would add the OSD in the correct position but leave its CRUSH 
>>> weight at 0.  That isn't ideal, though, since you have to go set the real 
>>> size/weight manually.
>>>
>>> If you're using the compat weight-set balancer mode, what would be better 
>>> is to instead set 
>>>
>>>  osd_crush_update_weight_set = false
>>>
>>> This means that the CRUSH weight would still be set to the size in TiB (as 
>>> usual) but the weight-set value would be 0, so it doesn't actually get any 
>>> data initially.  Then let the balancer work its magic in the background to 
>>> ramp the weight slowly and migrate data.
>>>
>>> I don't think this was ever tested, but this was how it was intended to 
>>> work.  I'd love to hear if it works in practice (or how badly it falls 
>>> over!).  If it does behave as intended, we might even consider making this 
>>> behavior the default...
>>>
>>
>> Would that work? Because even though the OSDs are added with a zero (0)
>> weight in the CRUSHMap the topology still changes.
>>
>> It could be a new host or an additional OSD and that already changes the
>> CRUSHmap even if these new OSDs have a weight of 0.
> 
> A zero weight item won't affect placement.
>  

I see, I stand corrected there :)

>> In the cases where I use the balancer it will be using the upmap as that
>> is far more effective than the compat mode.
> 
> ...but you're right, it won't help in the upmap case.
> 
>> Hence my original idea: commit and confirm
>>
>> The MONs would keep a temp OSDMap and commit all their changes on that
>> map. Once you 'confirm' the changes they will be merged into the active
>> OSDMap and send these out to the cluster.
> 
> The problem with this is that it's hard/impossible for the mon to 
> distinguish between two different streams of osdmap updates: those from 
> the human who is going to interactively say "ok, looks good" and those 
> from other cluster activity (pgs peering who need their up_thru value 
> changed, other osds that happened to fail or (re)start during this period, 
> and so on.

I see, although I was mainly pointing at CRUSHMap updates. Once you go
into 'commit' mode the CRUSHMap is blocked for changes and the MONs take
the current one and store it somewhere outside the OSDMap.

When you commit the changes are compiled and the current OSDMap is
updated with the new CRUSHMap.

> 
> If it's specifically OSD addition that we're worried about, we should 
> address that specifically.  Either,
> 
>  - OSDs aren't added into position at all except in a batch, 
> interactively, by an administrator, or
>  - OSDs are added with weight 0, and their weights are set to the non-zero 
> targets interactively, in a batch, by an administrator.
> 

It would be ideal if that could happen automatically. The OSDs are added
with a zero weight, but with a single command you can have all the OSDs
for example update their weight to the correct weight.

$ ceph tell osd.* crush update weight

(Just as an example)

> In the crush-compat case, it feels like we cover this already with the 
> osd_crush_update_weight_set = false option (assuming we test and verify it 
> works as expected).  For upmap, we could do something similar, where the 
> new OSD is added but all PGs that would get mapped to it are upmap'ed 
> away initially (similar to prime_pg_temp).  And if the balancer is off 
> entirely, then 'osd_crush_initial_weight = 0' seems like the right thing.
> 
> Alternatively, the 'osd_crush_initial_weight = 0' (or a similar option) 
> could be changed so that when teh osd is added it records it's real weight 
> somewhere else but leaves the crush weight to 0, and a health alert comes 
> up saying there are N new OSDs pending final inclusion, and a 
> single buttom/command sets them, similar to your commit (and diff etc) 
> commands above.
> 

Yes, just like I mentioned above. Any OSD with a weight of 0 would then
update it's weight to the correct value which then initiates the Peering
of PGs and backfilling starts.

The OSDs would store something like 'device_weight' in their internal
datastore and calculate this on mkfs. If you when issue a command all
OSDs with a CRUSH weight of 0 would then update their weights in the
CRUSHMap.

Together with upmap you can make the addition of OSDs a very smooth process.

I'm currently in the middle of expanding a 10PB cluster with an
additional 4PB of capacity and peering and/or CRUSH changes in such a
cluster should not be taken lightly. I try to avoid as much peering as
possible.

> Is it necessary to stage crush changes other than OSD additions?  I just 
> worry that there are cases where staging a change will break something.  
> Like the addition of a new crush rule to create a new pool.  Or the 
> balancer trying to create a compat-set.  Or something else...
> 

Maybe when removing nodes. But I think this can be fixed easier by
setting the 'norebalance' flag, then run a bunch of OSD purge commands
and CRUSH commands to remove the nodes.

This will however still generate multiple CRUSH updates instead of one
update where suddenly X amount of OSDs / buckets are going. The latter
seems cheaper on CPU cycles and peering.

Wido

> sage
>