Re: cephadm autotuning ceph-osd memory

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 11 Feb 2021 18:46:56 -0600

On 2/11/21 5:11 PM, Sage Weil wrote:
Hi everyone,

One of the things that ceph-ansible does that cephadm does not is 
automatically adjust the osd_memory_target based on the amount of 
memory and number of OSDs on the host.  The approach is pretty simple: 
if the not_hci flag is set, then we take some factor (.7) * total 
memory and divide by osd count, and set osd_memory_target accordingly.

We'd like to bring this "hands off" behavior over to cephadm in some 
form.. ideally one that is a bit more sophisticated.  The currently 
proposal in written up in this pad:

https://pad.ceph.com/p/autotune_memory_target

The key ideas:
- An autotune_memory_target config option (boolean) can be enabled for 
a specific daemon, class, host, or the whole cluster
- It will apply not just to OSDs but other daemons. Currently mon and 
osd have a memory_target config; MDS has a similar setting for its 
cache that we can work with.  The other daemons don't have 
configurable memory footprints (for the most part) but we can come up 
with some estimate of their *usage* based on some conservative 
estimates, maybe with cluster size factored in.
- The autotuner would look at other (ceph) daemons on the host that 
aren't being tuned and deduct that memory from the total available, 
and then divvy up the remaining memory among the autotuned daemons.  
Initially we'd only support autotuning osd, mon, and mds.

There are some open questions about how to implement some of the 
details, though.

For instance, the current proposal doesn't record what the "tuned" 
memory is in ceph anywhere, but rather would look at what 'cephadm ls' 
says about the deployed containers' limits, compare that to what it 
thinks should happen, and redeploy daemons as needed.

It might also make sense to allow these targets to be expressed via 
the 'service spec' yaml, which is more or less equivalent to the CRs 
in Rook (which currently do allow osd memory target to be set).

I'm not entirely sure if this should be an orchestrator function 
(i.e., work with both rook and cephadm) or cephadm-only.  It would be 
helpful to hear whether this sort of function would make any sense in 
a rook cluster... are there cases where the kubernetes nodes are 
dedicated to ceph and it makes sense to magically slurp up all of the 
hosts' RAM?  We probably don't want to be in the business of checking 
with kubernetes about unused host resources and adjusting things 
whenever the k8s scheduler does something on the host...

sage

One question is what you do when you want to add more daemons to a node 
after the targets have already been set.  An idea I've kicked around for 
bare-metal clusters is to let the daemons float their memory target 
between lower and upper bounds based on how close to a global memory 
target all daemons on a particular node are in aggregate.  Each daemon 
would be constantly re-evaluating it's local target like we already do 
when determining the aggregate cache budget based on that memory 
target.  One really neat thing you could with this model is set the 
aggressiveness of each daemon based on the priorities and sizes of 
unfulfilled allocation requests in the priority cache manager.  The idea 
being that daemons that really aggressively want memory at high 
priorities can float their targets up and daemons that don't really care 
much can float their targets down.  Ultimately all daemons would still 
be governed based on that global target though.  (ie if we get close to 
or over that global target, even aggressive daemons would necessarily 
back off and reduce their targets).    I suspect this can be done on 
bare metal without strictly requiring coordination in the mgr (though we 
could certainly do it that way too).

I agree for containers.  It's...different.  I don't know if we can even 
really adjust container resource limits on the fly yet. We could set the 
container limits really high (which is basically what we already do, the 
memory_target is already way lower than the container limit).  I don't 
know that we want to be trying to coordinate memory usage behind the 
containers back like what I proposed above.  Ideally there would be some 
kind of rich api where something inside the container can negotiate for 
more (or less) memory for the container as a whole.  I don't think 
anything like this exists though.

Mark

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx