Re: cephadm autotuning ceph-osd memory

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 12 Feb 2021 09:47:57 -0600

I'm not sure if it will be in the initial pacific release but it will
certainly get backported as soon as possible.

On Fri, Feb 12, 2021 at 8:51 AM John Fulton <johfulto@xxxxxxxxxx> wrote:
>
> On Thu, Feb 11, 2021 at 7:47 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> >
> >
> > On 2/11/21 5:11 PM, Sage Weil wrote:
> > > Hi everyone,
> > >
> > > One of the things that ceph-ansible does that cephadm does not is
> > > automatically adjust the osd_memory_target based on the amount of
> > > memory and number of OSDs on the host.  The approach is pretty simple:
> > > if the not_hci flag is set, then we take some factor (.7) * total
> > > memory and divide by osd count, and set osd_memory_target accordingly.
> > >
> > > We'd like to bring this "hands off" behavior over to cephadm in some
> > > form.. ideally one that is a bit more sophisticated.  The currently
> > > proposal in written up in this pad:
> > >
> > > https://pad.ceph.com/p/autotune_memory_target
>
> Thank you. This would be very useful for the TripleO project's
> integration of cephadm [1] to deploy colocated OpenStack/Ceph.
>
> I understand you're still scoping the feature but can you say at this
> point if you think it would be available in Pacific?
>
> Thanks,
>   John
>
> [1] https://specs.openstack.org/openstack/tripleo-specs/specs/wallaby/tripleo-ceph.html
>
>
> > > The key ideas:
> > > - An autotune_memory_target config option (boolean) can be enabled for
> > > a specific daemon, class, host, or the whole cluster
> > > - It will apply not just to OSDs but other daemons. Currently mon and
> > > osd have a memory_target config; MDS has a similar setting for its
> > > cache that we can work with.  The other daemons don't have
> > > configurable memory footprints (for the most part) but we can come up
> > > with some estimate of their *usage* based on some conservative
> > > estimates, maybe with cluster size factored in.
> > > - The autotuner would look at other (ceph) daemons on the host that
> > > aren't being tuned and deduct that memory from the total available,
> > > and then divvy up the remaining memory among the autotuned daemons.
> > > Initially we'd only support autotuning osd, mon, and mds.
> > >
> > > There are some open questions about how to implement some of the
> > > details, though.
> > >
> > > For instance, the current proposal doesn't record what the "tuned"
> > > memory is in ceph anywhere, but rather would look at what 'cephadm ls'
> > > says about the deployed containers' limits, compare that to what it
> > > thinks should happen, and redeploy daemons as needed.
> > >
> > > It might also make sense to allow these targets to be expressed via
> > > the 'service spec' yaml, which is more or less equivalent to the CRs
> > > in Rook (which currently do allow osd memory target to be set).
> > >
> > > I'm not entirely sure if this should be an orchestrator function
> > > (i.e., work with both rook and cephadm) or cephadm-only.  It would be
> > > helpful to hear whether this sort of function would make any sense in
> > > a rook cluster... are there cases where the kubernetes nodes are
> > > dedicated to ceph and it makes sense to magically slurp up all of the
> > > hosts' RAM?  We probably don't want to be in the business of checking
> > > with kubernetes about unused host resources and adjusting things
> > > whenever the k8s scheduler does something on the host...
> > >
> > > sage
> >
> >
> > One question is what you do when you want to add more daemons to a node
> > after the targets have already been set.  An idea I've kicked around for
> > bare-metal clusters is to let the daemons float their memory target
> > between lower and upper bounds based on how close to a global memory
> > target all daemons on a particular node are in aggregate.  Each daemon
> > would be constantly re-evaluating it's local target like we already do
> > when determining the aggregate cache budget based on that memory
> > target.  One really neat thing you could with this model is set the
> > aggressiveness of each daemon based on the priorities and sizes of
> > unfulfilled allocation requests in the priority cache manager.  The idea
> > being that daemons that really aggressively want memory at high
> > priorities can float their targets up and daemons that don't really care
> > much can float their targets down.  Ultimately all daemons would still
> > be governed based on that global target though.  (ie if we get close to
> > or over that global target, even aggressive daemons would necessarily
> > back off and reduce their targets).    I suspect this can be done on
> > bare metal without strictly requiring coordination in the mgr (though we
> > could certainly do it that way too).
> >
> > I agree for containers.  It's...different.  I don't know if we can even
> > really adjust container resource limits on the fly yet. We could set the
> > container limits really high (which is basically what we already do, the
> > memory_target is already way lower than the container limit).  I don't
> > know that we want to be trying to coordinate memory usage behind the
> > containers back like what I proposed above.  Ideally there would be some
> > kind of rich api where something inside the container can negotiate for
> > more (or less) memory for the container as a whole.  I don't think
> > anything like this exists though.
> >
> >
> > Mark
> >
> >
> > >
> > > _______________________________________________
> > > Dev mailing list -- dev@xxxxxxx
> > > To unsubscribe send an email to dev-leave@xxxxxxx
> > _______________________________________________
> > Dev mailing list -- dev@xxxxxxx
> > To unsubscribe send an email to dev-leave@xxxxxxx
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx