Re: cephadm autotuning ceph-osd memory

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 16 Feb 2021 06:07:16 -0600

On 2/16/21 4:29 AM, Lars Marowsky-Bree wrote:
On 2021-02-12T09:00:41, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:

Containers don't have to have a memory cgroup limit. It may be helpful
to avoid that with cephadm, perhaps using a strict memory limit in
testing but not in production to avoid potential availability problems.
Instead of handling this externally (and fairly statically?) via
cephadm, how about the Ceph daemons on a node communicate to each other
about memory availability/allocation/requirements dynamically?

Then the "total" memory limit of Ceph on a node would be "per-pod". Ceph
would manage the resources allocated to it itself.

Say, going back to the point raised earlier, if an additional OSD is
started, all others would reduce their cache targets dynamically to make
room for that one (assuming there's enough space, otherwise the new
daemon wouldn't fully spin up and exit/pause).

The overall limits can even be enforced via the OS without k8s pods in
cgroups if so chosen.

This is more or less the model I've had stewing on the backburner in my 
brain since kubecon in Barcelona.  I'm not even sure we need very much 
communication to make it happen.  Each daemon will know a global 
(node/pod/whatever) target (already shared via ceph.conf or pod or 
whatever).  For individual memory usage calculations we can't really use 
RSS (We tried; It can lead to really bad thrashing because the kernel 
isn't guaranteed to release memory when we free it). For more global 
decisions we might be able to look at neighbor process info, though it's 
probably safer to keep using the current tcmalloc stats which we would 
have to share (but we already collect this in the perf counters for 
daemons using the prioritycachemanager, and we can update it fairly 
slower for this purpose).  Ultimately each deamon would see not only 
it's local usage but the combined usage of all daemons in the same 
group. Each daemon would independently calculate how aggressive it wants 
to be given the fraction of the total it's already using and how close 
to the global limit all daemons are in aggregate.  As the aggregate 
usage gets closer to the aggregate total, each daemon independently 
backs off at varying rates.  The emergent behavior should be an 
aggregate backing off but with the individual rates based on local cache 
priorities combined with how close to the aggregate limit we are.  As we 
get closer to the limit, some daemons would need to start freeing memory 
early to make up for daemons that are being more aggressive.  Ultimately 
*all* daemons would need to back off when near (or above!) the target 
(Even if some do so earlier than others).  As local priorities change 
we'd see small changes in allocations per daemon.  To avoid this we 
could use the same "chunky" allocation scheme we use locally to avoid 
constant small allocation changes as relative aggressiveness changes 
slightly over time.  None of this guarantees we always stay below the 
aggregate target (that's much harder given how our daemons work), but we 
could stay under the aggregate target with very high probability and 
hopefully minor spikes relative to the aggregate total.

Mark

Regards,
     Lars

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx