( Semi re-send to pass the vger plain-text requirement, sorry for duplication ) On Thu, Mar 7, 2019 at 2:25 PM Sage Weil <sweil@xxxxxxxxxx> wrote: > > On Thu, 7 Mar 2019, Patrick Donnelly wrote: > > On Thu, Mar 7, 2019 at 11:16 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > > > > > On Thu, Mar 7, 2019 at 7:27 AM Sage Weil <sweil@xxxxxxxxxx> wrote: > > > > > > > > On Thu, 7 Mar 2019, Sage Weil wrote: > > > > > On Thu, 7 Mar 2019, Sebastien Han wrote: > > > > > > Let me take back something, afterthoughts, yes "memory.request" is > > > > > > what we are interested in, the only thing is that there is no > > > > > > /sys/fs/cgroup/memory equivalent. > > > > > > > > > > Okay, if that's the case, then we have to take the limit from the > > > > > kubernetes-specific environment variable. The downside is it's > > > > > kube-specific, but so be it. > > > > > > > > > > > If we take the OSD example, we could set REQUEST to 1GB and LIMIT to > > > > > > 2GB (making up numbers again) which leaves 1GB for backfilling and > > > > > > other stuff. > > > > > > > > > > Just a side-note here: there is no longer an recovery vs non-recovery > > > > > memory utilization delta. IIRC we even backported that config > > > > > osd_pg_log_{min,max}_entries change to luminous too. > > > > > > > > > > > The last thing is, as part of these 4GB how much do we want to give to > > > > > > the osd_memory_target. > > > > > > > > > > I think we should set osd_memory_target to POD_MEMORY_REQUEST if it's > > > > > non-zero. If it is 0 and POD_MEMORY_LIMIT is set, I suggest .8 * that. > > > > > > > > Hmm, I think the place to do that is here: > > > > > > > > https://github.com/ceph/ceph/blob/master/src/common/config.cc#L456 > > > > > > > > It's a bit awkward to parse the env variable into one of > > > > {osd,mds,mon}_memory_target based on the daemon type, but it could be done > > > > (once the others exist). > > > > > > Surely it’s easier to do a translation from Kubernetes configs to Ceph > > > configs at the init system layer (whatever that is for > > > Rook/Kubernetes). Why do we want to embed it within the Ceph C++ > > > source? If we do that we'll need to do the same thing all over again > > > for non-Kubernetes and put it in the same place, even though the logic > > > is likely to be completely different and involve checking state that > > > the daemon won't have. > > > > Looking at the cgroup memory limit is going to be about as fundamental > > as getrusage(3) or reading proc. It's also used by systemd (see > > systemd.resource-control(5)). > > > > If the system is telling us our resource limits, we should listen! Sure, but we're talking about looking for Kubernetes-set environment variables, not cgroup limits. > > The problem here is that this is set based on POD_MEMORY_LIMIT, the point > at which we get ENOMEM, not the target POD_MEMORY_REQUEST, which is what > we should actually try to achieve. IIUC the latter isn't going to be > found in any standard place in the container environment (besides the > kubernetes-specific POD_MEMORY_REQUEST). > > With that caveat, it seems like we should *also* look at the cgroup limit > * some factor (e.g., .8) as use it as a ceiling for *_memory_target... Even that probably has too many assumptions though. If we're in a cgroup in a non-Kubernetes/plain container context, there's every possibility that we aren't the only consumer of that cgroup's resource limits (for instance, all of the Ceph processes on a machine in one cgroup, or vhost-style process isolation with all of a tenant in one cgroup), so there would need to be another input telling us how much memory to target for any individual process. :/ -Greg