Re: cephadm autotuning ceph-osd memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/16/21 4:29 AM, Lars Marowsky-Bree wrote:
On 2021-02-12T09:00:41, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:

Containers don't have to have a memory cgroup limit. It may be helpful
to avoid that with cephadm, perhaps using a strict memory limit in
testing but not in production to avoid potential availability problems.
Instead of handling this externally (and fairly statically?) via
cephadm, how about the Ceph daemons on a node communicate to each other
about memory availability/allocation/requirements dynamically?

Then the "total" memory limit of Ceph on a node would be "per-pod". Ceph
would manage the resources allocated to it itself.

Say, going back to the point raised earlier, if an additional OSD is
started, all others would reduce their cache targets dynamically to make
room for that one (assuming there's enough space, otherwise the new
daemon wouldn't fully spin up and exit/pause).

The overall limits can even be enforced via the OS without k8s pods in
cgroups if so chosen.


This is more or less the model I've had stewing on the backburner in my brain since kubecon in Barcelona.  I'm not even sure we need very much communication to make it happen.  Each daemon will know a global (node/pod/whatever) target (already shared via ceph.conf or pod or whatever).  For individual memory usage calculations we can't really use RSS (We tried; It can lead to really bad thrashing because the kernel isn't guaranteed to release memory when we free it). For more global decisions we might be able to look at neighbor process info, though it's probably safer to keep using the current tcmalloc stats which we would have to share (but we already collect this in the perf counters for daemons using the prioritycachemanager, and we can update it fairly slower for this purpose).  Ultimately each deamon would see not only it's local usage but the combined usage of all daemons in the same group. Each daemon would independently calculate how aggressive it wants to be given the fraction of the total it's already using and how close to the global limit all daemons are in aggregate.  As the aggregate usage gets closer to the aggregate total, each daemon independently backs off at varying rates.  The emergent behavior should be an aggregate backing off but with the individual rates based on local cache priorities combined with how close to the aggregate limit we are.  As we get closer to the limit, some daemons would need to start freeing memory early to make up for daemons that are being more aggressive.  Ultimately *all* daemons would need to back off when near (or above!) the target (Even if some do so earlier than others).  As local priorities change we'd see small changes in allocations per daemon.  To avoid this we could use the same "chunky" allocation scheme we use locally to avoid constant small allocation changes as relative aggressiveness changes slightly over time.  None of this guarantees we always stay below the aggregate target (that's much harder given how our daemons work), but we could stay under the aggregate target with very high probability and hopefully minor spikes relative to the aggregate total.


Mark





Regards,
     Lars

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux