Re: cephadm autotuning ceph-osd memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 11, 2021 at 7:47 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>
>
> On 2/11/21 5:11 PM, Sage Weil wrote:
> > Hi everyone,
> >
> > One of the things that ceph-ansible does that cephadm does not is
> > automatically adjust the osd_memory_target based on the amount of
> > memory and number of OSDs on the host.  The approach is pretty simple:
> > if the not_hci flag is set, then we take some factor (.7) * total
> > memory and divide by osd count, and set osd_memory_target accordingly.
> >
> > We'd like to bring this "hands off" behavior over to cephadm in some
> > form.. ideally one that is a bit more sophisticated.  The currently
> > proposal in written up in this pad:
> >
> > https://pad.ceph.com/p/autotune_memory_target

Thank you. This would be very useful for the TripleO project's
integration of cephadm [1] to deploy colocated OpenStack/Ceph.

I understand you're still scoping the feature but can you say at this
point if you think it would be available in Pacific?

Thanks,
  John

[1] https://specs.openstack.org/openstack/tripleo-specs/specs/wallaby/tripleo-ceph.html


> > The key ideas:
> > - An autotune_memory_target config option (boolean) can be enabled for
> > a specific daemon, class, host, or the whole cluster
> > - It will apply not just to OSDs but other daemons. Currently mon and
> > osd have a memory_target config; MDS has a similar setting for its
> > cache that we can work with.  The other daemons don't have
> > configurable memory footprints (for the most part) but we can come up
> > with some estimate of their *usage* based on some conservative
> > estimates, maybe with cluster size factored in.
> > - The autotuner would look at other (ceph) daemons on the host that
> > aren't being tuned and deduct that memory from the total available,
> > and then divvy up the remaining memory among the autotuned daemons.
> > Initially we'd only support autotuning osd, mon, and mds.
> >
> > There are some open questions about how to implement some of the
> > details, though.
> >
> > For instance, the current proposal doesn't record what the "tuned"
> > memory is in ceph anywhere, but rather would look at what 'cephadm ls'
> > says about the deployed containers' limits, compare that to what it
> > thinks should happen, and redeploy daemons as needed.
> >
> > It might also make sense to allow these targets to be expressed via
> > the 'service spec' yaml, which is more or less equivalent to the CRs
> > in Rook (which currently do allow osd memory target to be set).
> >
> > I'm not entirely sure if this should be an orchestrator function
> > (i.e., work with both rook and cephadm) or cephadm-only.  It would be
> > helpful to hear whether this sort of function would make any sense in
> > a rook cluster... are there cases where the kubernetes nodes are
> > dedicated to ceph and it makes sense to magically slurp up all of the
> > hosts' RAM?  We probably don't want to be in the business of checking
> > with kubernetes about unused host resources and adjusting things
> > whenever the k8s scheduler does something on the host...
> >
> > sage
>
>
> One question is what you do when you want to add more daemons to a node
> after the targets have already been set.  An idea I've kicked around for
> bare-metal clusters is to let the daemons float their memory target
> between lower and upper bounds based on how close to a global memory
> target all daemons on a particular node are in aggregate.  Each daemon
> would be constantly re-evaluating it's local target like we already do
> when determining the aggregate cache budget based on that memory
> target.  One really neat thing you could with this model is set the
> aggressiveness of each daemon based on the priorities and sizes of
> unfulfilled allocation requests in the priority cache manager.  The idea
> being that daemons that really aggressively want memory at high
> priorities can float their targets up and daemons that don't really care
> much can float their targets down.  Ultimately all daemons would still
> be governed based on that global target though.  (ie if we get close to
> or over that global target, even aggressive daemons would necessarily
> back off and reduce their targets).    I suspect this can be done on
> bare metal without strictly requiring coordination in the mgr (though we
> could certainly do it that way too).
>
> I agree for containers.  It's...different.  I don't know if we can even
> really adjust container resource limits on the fly yet. We could set the
> container limits really high (which is basically what we already do, the
> memory_target is already way lower than the container limit).  I don't
> know that we want to be trying to coordinate memory usage behind the
> containers back like what I proposed above.  Ideally there would be some
> kind of rich api where something inside the container can negotiate for
> more (or less) memory for the container as a whole.  I don't think
> anything like this exists though.
>
>
> Mark
>
>
> >
> > _______________________________________________
> > Dev mailing list -- dev@xxxxxxx
> > To unsubscribe send an email to dev-leave@xxxxxxx
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux