Re: cephadm autotuning ceph-osd memory

Josh Durgin <jdurgin@xxxxxxxxxx> · Fri, 12 Feb 2021 09:00:41 -0800

Containers don't have to have a memory cgroup limit. It may be helpful
to avoid that with cephadm, perhaps using a strict memory limit in
testing but not in production to avoid potential availability problems.

oomkilling is a bit drastic.

Josh

On 2/12/21 8:56 AM, Mark Nelson wrote:
I suspect mon cache is not super important until something terrible 
happens and you've got an insane amount of rocksdb traffic where you 
really need the rocksdb block cache to avoid a ton of disk reads. I 
recall we've hit situations like that in the past.  That might be a case 
where we would want the mon to be able to request "burst" memory at high 
priority for the block cache and lower the memory of any other daemons 
on the same node.  As we talked about in the standup this morning this 
is difficult to do in the container world right now, but we could move 
toward a design that lets us do this on bare metal and potentially do it 
in containers if there's ever support for changing the container memory 
limits on the fly.  Already we can change things like the 
osd_memory_target.  It wouldn't be a stretch to let daemons float above 
or below that target based on relative priorities of requests in the 
priority cache manager imho.  It's the just the container part that gets 
tricky.

Mark

On 2/12/21 10:05 AM, Josh Durgin wrote:
I left some notes on the pad - I think we'd want to treat mons a bit
differently from other daemons, since more memory there isn't so
important for performance, and perhaps scales with cluster size / osdmap
churn rather than working set.

Josh

On 2/12/21 7:47 AM, Sage Weil wrote:
I'm not sure if it will be in the initial pacific release but it will
certainly get backported as soon as possible.

On Fri, Feb 12, 2021 at 8:51 AM John Fulton <johfulto@xxxxxxxxxx> wrote:

On Thu, Feb 11, 2021 at 7:47 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:

On 2/11/21 5:11 PM, Sage Weil wrote:
Hi everyone,

One of the things that ceph-ansible does that cephadm does not is
automatically adjust the osd_memory_target based on the amount of
memory and number of OSDs on the host.  The approach is pretty 
simple:
if the not_hci flag is set, then we take some factor (.7) * total
memory and divide by osd count, and set osd_memory_target 
accordingly.

We'd like to bring this "hands off" behavior over to cephadm in some
form.. ideally one that is a bit more sophisticated.  The currently
proposal in written up in this pad:

https://pad.ceph.com/p/autotune_memory_target

Thank you. This would be very useful for the TripleO project's
integration of cephadm [1] to deploy colocated OpenStack/Ceph.

I understand you're still scoping the feature but can you say at this
point if you think it would be available in Pacific?

Thanks,
   John

[1] 
https://specs.openstack.org/openstack/tripleo-specs/specs/wallaby/tripleo-ceph.html 

The key ideas:
- An autotune_memory_target config option (boolean) can be enabled 
for
a specific daemon, class, host, or the whole cluster
- It will apply not just to OSDs but other daemons. Currently mon and
osd have a memory_target config; MDS has a similar setting for its
cache that we can work with.  The other daemons don't have
configurable memory footprints (for the most part) but we can come up
with some estimate of their *usage* based on some conservative
estimates, maybe with cluster size factored in.
- The autotuner would look at other (ceph) daemons on the host that
aren't being tuned and deduct that memory from the total available,
and then divvy up the remaining memory among the autotuned daemons.
Initially we'd only support autotuning osd, mon, and mds.

There are some open questions about how to implement some of the
details, though.

For instance, the current proposal doesn't record what the "tuned"
memory is in ceph anywhere, but rather would look at what 'cephadm 
ls'
says about the deployed containers' limits, compare that to what it
thinks should happen, and redeploy daemons as needed.

It might also make sense to allow these targets to be expressed via
the 'service spec' yaml, which is more or less equivalent to the CRs
in Rook (which currently do allow osd memory target to be set).

I'm not entirely sure if this should be an orchestrator function
(i.e., work with both rook and cephadm) or cephadm-only. It would be
helpful to hear whether this sort of function would make any sense in
a rook cluster... are there cases where the kubernetes nodes are
dedicated to ceph and it makes sense to magically slurp up all of the
hosts' RAM?  We probably don't want to be in the business of checking
with kubernetes about unused host resources and adjusting things
whenever the k8s scheduler does something on the host...

sage

One question is what you do when you want to add more daemons to a 
node
after the targets have already been set.  An idea I've kicked 
around for
bare-metal clusters is to let the daemons float their memory target
between lower and upper bounds based on how close to a global memory
target all daemons on a particular node are in aggregate. Each daemon
would be constantly re-evaluating it's local target like we already do
when determining the aggregate cache budget based on that memory
target.  One really neat thing you could with this model is set the
aggressiveness of each daemon based on the priorities and sizes of
unfulfilled allocation requests in the priority cache manager.  The 
idea
being that daemons that really aggressively want memory at high
priorities can float their targets up and daemons that don't really 
care
much can float their targets down.  Ultimately all daemons would still
be governed based on that global target though.  (ie if we get 
close to
or over that global target, even aggressive daemons would necessarily
back off and reduce their targets).    I suspect this can be done on
bare metal without strictly requiring coordination in the mgr 
(though we
could certainly do it that way too).

I agree for containers.  It's...different.  I don't know if we can 
even
really adjust container resource limits on the fly yet. We could 
set the
container limits really high (which is basically what we already 
do, the
memory_target is already way lower than the container limit).  I don't
know that we want to be trying to coordinate memory usage behind the
containers back like what I proposed above.  Ideally there would be 
some
kind of rich api where something inside the container can negotiate 
for
more (or less) memory for the container as a whole.  I don't think
anything like this exists though.

Mark

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx