On Thu, Feb 11, 2021 at 7:47 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote: > > > On 2/11/21 5:11 PM, Sage Weil wrote: > > Hi everyone, > > > > One of the things that ceph-ansible does that cephadm does not is > > automatically adjust the osd_memory_target based on the amount of > > memory and number of OSDs on the host. The approach is pretty simple: > > if the not_hci flag is set, then we take some factor (.7) * total > > memory and divide by osd count, and set osd_memory_target accordingly. > > > > We'd like to bring this "hands off" behavior over to cephadm in some > > form.. ideally one that is a bit more sophisticated. The currently > > proposal in written up in this pad: > > > > https://pad.ceph.com/p/autotune_memory_target Thank you. This would be very useful for the TripleO project's integration of cephadm [1] to deploy colocated OpenStack/Ceph. I understand you're still scoping the feature but can you say at this point if you think it would be available in Pacific? Thanks, John [1] https://specs.openstack.org/openstack/tripleo-specs/specs/wallaby/tripleo-ceph.html > > The key ideas: > > - An autotune_memory_target config option (boolean) can be enabled for > > a specific daemon, class, host, or the whole cluster > > - It will apply not just to OSDs but other daemons. Currently mon and > > osd have a memory_target config; MDS has a similar setting for its > > cache that we can work with. The other daemons don't have > > configurable memory footprints (for the most part) but we can come up > > with some estimate of their *usage* based on some conservative > > estimates, maybe with cluster size factored in. > > - The autotuner would look at other (ceph) daemons on the host that > > aren't being tuned and deduct that memory from the total available, > > and then divvy up the remaining memory among the autotuned daemons. > > Initially we'd only support autotuning osd, mon, and mds. > > > > There are some open questions about how to implement some of the > > details, though. > > > > For instance, the current proposal doesn't record what the "tuned" > > memory is in ceph anywhere, but rather would look at what 'cephadm ls' > > says about the deployed containers' limits, compare that to what it > > thinks should happen, and redeploy daemons as needed. > > > > It might also make sense to allow these targets to be expressed via > > the 'service spec' yaml, which is more or less equivalent to the CRs > > in Rook (which currently do allow osd memory target to be set). > > > > I'm not entirely sure if this should be an orchestrator function > > (i.e., work with both rook and cephadm) or cephadm-only. It would be > > helpful to hear whether this sort of function would make any sense in > > a rook cluster... are there cases where the kubernetes nodes are > > dedicated to ceph and it makes sense to magically slurp up all of the > > hosts' RAM? We probably don't want to be in the business of checking > > with kubernetes about unused host resources and adjusting things > > whenever the k8s scheduler does something on the host... > > > > sage > > > One question is what you do when you want to add more daemons to a node > after the targets have already been set. An idea I've kicked around for > bare-metal clusters is to let the daemons float their memory target > between lower and upper bounds based on how close to a global memory > target all daemons on a particular node are in aggregate. Each daemon > would be constantly re-evaluating it's local target like we already do > when determining the aggregate cache budget based on that memory > target. One really neat thing you could with this model is set the > aggressiveness of each daemon based on the priorities and sizes of > unfulfilled allocation requests in the priority cache manager. The idea > being that daemons that really aggressively want memory at high > priorities can float their targets up and daemons that don't really care > much can float their targets down. Ultimately all daemons would still > be governed based on that global target though. (ie if we get close to > or over that global target, even aggressive daemons would necessarily > back off and reduce their targets). I suspect this can be done on > bare metal without strictly requiring coordination in the mgr (though we > could certainly do it that way too). > > I agree for containers. It's...different. I don't know if we can even > really adjust container resource limits on the fly yet. We could set the > container limits really high (which is basically what we already do, the > memory_target is already way lower than the container limit). I don't > know that we want to be trying to coordinate memory usage behind the > containers back like what I proposed above. Ideally there would be some > kind of rich api where something inside the container can negotiate for > more (or less) memory for the container as a whole. I don't think > anything like this exists though. > > > Mark > > > > > > _______________________________________________ > > Dev mailing list -- dev@xxxxxxx > > To unsubscribe send an email to dev-leave@xxxxxxx > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx