On 2/16/21 4:29 AM, Lars Marowsky-Bree wrote:
On 2021-02-12T09:00:41, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
Containers don't have to have a memory cgroup limit. It may be helpful
to avoid that with cephadm, perhaps using a strict memory limit in
testing but not in production to avoid potential availability problems.
Instead of handling this externally (and fairly statically?) via
cephadm, how about the Ceph daemons on a node communicate to each other
about memory availability/allocation/requirements dynamically?
Then the "total" memory limit of Ceph on a node would be "per-pod". Ceph
would manage the resources allocated to it itself.
Say, going back to the point raised earlier, if an additional OSD is
started, all others would reduce their cache targets dynamically to make
room for that one (assuming there's enough space, otherwise the new
daemon wouldn't fully spin up and exit/pause).
The overall limits can even be enforced via the OS without k8s pods in
cgroups if so chosen.
This is more or less the model I've had stewing on the backburner in my
brain since kubecon in Barcelona. I'm not even sure we need very much
communication to make it happen. Each daemon will know a global
(node/pod/whatever) target (already shared via ceph.conf or pod or
whatever). For individual memory usage calculations we can't really use
RSS (We tried; It can lead to really bad thrashing because the kernel
isn't guaranteed to release memory when we free it). For more global
decisions we might be able to look at neighbor process info, though it's
probably safer to keep using the current tcmalloc stats which we would
have to share (but we already collect this in the perf counters for
daemons using the prioritycachemanager, and we can update it fairly
slower for this purpose). Ultimately each deamon would see not only
it's local usage but the combined usage of all daemons in the same
group. Each daemon would independently calculate how aggressive it wants
to be given the fraction of the total it's already using and how close
to the global limit all daemons are in aggregate. As the aggregate
usage gets closer to the aggregate total, each daemon independently
backs off at varying rates. The emergent behavior should be an
aggregate backing off but with the individual rates based on local cache
priorities combined with how close to the aggregate limit we are. As we
get closer to the limit, some daemons would need to start freeing memory
early to make up for daemons that are being more aggressive. Ultimately
*all* daemons would need to back off when near (or above!) the target
(Even if some do so earlier than others). As local priorities change
we'd see small changes in allocations per daemon. To avoid this we
could use the same "chunky" allocation scheme we use locally to avoid
constant small allocation changes as relative aggressiveness changes
slightly over time. None of this guarantees we always stay below the
aggregate target (that's much harder given how our daemons work), but we
could stay under the aggregate target with very high probability and
hopefully minor spikes relative to the aggregate total.
Mark
Regards,
Lars
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx