cephadm autotuning ceph-osd memory

Sage Weil <sweil@xxxxxxxxxx> · Thu, 11 Feb 2021 17:11:45 -0600

Hi everyone,
One of the things that ceph-ansible does that cephadm does not is automatically adjust the osd_memory_target based on the amount of memory and number of OSDs on the host.  The approach is pretty simple: if the not_hci flag is set, then we take some factor (.7) * total memory and divide by osd count, and set osd_memory_target accordingly.

We'd like to bring this "hands off" behavior over to cephadm in some form.. ideally one that is a bit more sophisticated.  The currently proposal in written up in this pad:

  https://pad.ceph.com/p/autotune_memory_target

The key ideas:
- An autotune_memory_target config option (boolean) can be enabled for a specific daemon, class, host, or the whole cluster
- It will apply not just to OSDs but other daemons.  Currently mon and osd have a memory_target config; MDS has a similar setting for its cache that we can work with.  The other daemons don't have configurable memory footprints (for the most part) but we can come up with some estimate of their *usage* based on some conservative estimates, maybe with cluster size factored in.
- The autotuner would look at other (ceph) daemons on the host that aren't being tuned and deduct that memory from the total available, and then divvy up the remaining memory among the autotuned daemons.  Initially we'd only support autotuning osd, mon, and mds.  

There are some open questions about how to implement some of the details, though.

For instance, the current proposal doesn't record what the "tuned" memory is in ceph anywhere, but rather would look at what 'cephadm ls' says about the deployed containers' limits, compare that to what it thinks should happen, and redeploy daemons as needed.  

It might also make sense to allow these targets to be expressed via the 'service spec' yaml, which is more or less equivalent to the CRs in Rook (which currently do allow osd memory target to be set).

I'm not entirely sure if this should be an orchestrator function (i.e., work with both rook and cephadm) or cephadm-only.  It would be helpful to hear whether this sort of function would make any sense in a rook cluster... are there cases where the kubernetes nodes are dedicated to ceph and it makes sense to magically slurp up all of the hosts' RAM?  We probably don't want to be in the business of checking with kubernetes about unused host resources and adjusting things whenever the k8s scheduler does something on the host...

sage
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx