Re: Ceph, container and memory

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Thu, 7 Mar 2019 15:15:46 -0800

On Thu, Mar 7, 2019 at 2:25 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> On Thu, 7 Mar 2019, Patrick Donnelly wrote:
> > On Thu, Mar 7, 2019 at 11:16 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > >
> > > On Thu, Mar 7, 2019 at 7:27 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
> > > >
> > > > On Thu, 7 Mar 2019, Sage Weil wrote:
> > > > > On Thu, 7 Mar 2019, Sebastien Han wrote:
> > > > > > Let me take back something, afterthoughts, yes "memory.request" is
> > > > > > what we are interested in, the only thing is that there is no
> > > > > > /sys/fs/cgroup/memory equivalent.
> > > > >
> > > > > Okay, if that's the case, then we have to take the limit from the
> > > > > kubernetes-specific environment variable.  The downside is it's
> > > > > kube-specific, but so be it.
> > > > >
> > > > > > If we take the OSD example, we could set REQUEST to 1GB and LIMIT to
> > > > > > 2GB (making up numbers again) which leaves 1GB for backfilling and
> > > > > > other stuff.
> > > > >
> > > > > Just a side-note here: there is no longer an recovery vs non-recovery
> > > > > memory utilization delta.  IIRC we even backported that config
> > > > > osd_pg_log_{min,max}_entries change to luminous too.
> > > > >
> > > > > > The last thing is, as part of these 4GB how much do we want to give to
> > > > > > the osd_memory_target.
> > > > >
> > > > > I think we should set osd_memory_target to POD_MEMORY_REQUEST if it's
> > > > > non-zero.  If it is 0 and POD_MEMORY_LIMIT is set, I suggest .8 * that.
> > > >
> > > > Hmm, I think the place to do that is here:
> > > >
> > > >   https://github.com/ceph/ceph/blob/master/src/common/config.cc#L456
> > > >
> > > > It's a bit awkward to parse the env variable into one of
> > > > {osd,mds,mon}_memory_target based on the daemon type, but it could be done
> > > > (once the others exist).
> > >
> > > Surely it’s easier to do a translation from Kubernetes configs to Ceph
> > > configs at the init system layer (whatever that is for
> > > Rook/Kubernetes). Why do we want to embed it within the Ceph C++
> > > source? If we do that we'll need to do the same thing all over again
> > > for non-Kubernetes and put it in the same place, even though the logic
> > > is likely to be completely different and involve checking state that
> > > the daemon won't have.
> >
> > Looking at the cgroup memory limit is going to be about as fundamental
> > as getrusage(3) or reading proc. It's also used by systemd (see
> > systemd.resource-control(5)).
> >
> > If the system is telling us our resource limits, we should listen!
>
> The problem here is that this is set based on POD_MEMORY_LIMIT, the point
> at which we get ENOMEM, not the target POD_MEMORY_REQUEST, which is what
> we should actually try to achieve.  IIUC the latter isn't going to be
> found in any standard place in the container environment (besides the
> kubernetes-specific POD_MEMORY_REQUEST).
>
> With that caveat, it seems like we should *also* look at the cgroup limit
> * some factor (e.g., .8) as use it as a ceiling for *_memory_target...

After thinking about this more and given that _REQUEST is not
available via the cgroup API, I'm wondering if _REQUEST is really
valuable or not. In practice (for Rook), what's valuable about
_REQUEST vs. _LIMIT?

We can still have options for setting the desired memory target but
why ask each operator to answer the question: "well, I have X amount
of RAM for this daemon but it probably should just use (X-Y) RAM to be
safe". Let's answer that question uniformly by adjusting config
defaults based on the hard-limit.

-- 
Patrick Donnelly