Re: Ceph, container and memory

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Fri, 8 Mar 2019 10:01:36 -0800

On Thu, Mar 7, 2019 at 3:25 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> On Thu, 7 Mar 2019, Patrick Donnelly wrote:
> > > The problem here is that this is set based on POD_MEMORY_LIMIT, the point
> > > at which we get ENOMEM, not the target POD_MEMORY_REQUEST, which is what
> > > we should actually try to achieve.  IIUC the latter isn't going to be
> > > found in any standard place in the container environment (besides the
> > > kubernetes-specific POD_MEMORY_REQUEST).
> > >
> > > With that caveat, it seems like we should *also* look at the cgroup limit
> > > * some factor (e.g., .8) as use it as a ceiling for *_memory_target...
> >
> > After thinking about this more and given that _REQUEST is not
> > available via the cgroup API, I'm wondering if _REQUEST is really
> > valuable or not. In practice (for Rook), what's valuable about
> > _REQUEST vs. _LIMIT?
> >
> > We can still have options for setting the desired memory target but
> > why ask each operator to answer the question: "well, I have X amount
> > of RAM for this daemon but it probably should just use (X-Y) RAM to be
> > safe". Let's answer that question uniformly by adjusting config
> > defaults based on the hard-limit.
>
> From the osd side of things, the reality is that whatever
> osd_memory_target is, our actual usage oscillates around that value by +/-
> 10%.  And in general that is probably what we want... the host will
> generally have some slack and short bursts will be okay.  And setting a
> hard limit too close to the target will mean we'll crash with ENOMEM too
> frequently.

But isn't this exactly the kind of expert knowledge we shouldn't
expect users/operators to have? If the OSD fluctuates around 10% of
the memory target, then we should probably have it use ~85% (or 66% as
you suggest below) of the hard limit. We can always provide a config
option for that percentage in case some unexpected workload breaks the
10% buffer.

> So, independent of what the mechanism is for setting the osd target,
> I'd probably want to set osd_memory_target == POD_MEMORY_REQUEST,
> and set POD_MEMORY_LIMIT 1.5x that as a safety stop.
>
> 1. We could get to the same place by setting
> osd_memory_target_as_cgroup_limit_ratio to .66, but that would have to be
> done in concert with whoever is setting POD_MEMORY_{LIMIT,REQUEST} so that
> the ratio matches.
>
> 2. Or, we could teach the OSD to pick up on POD_MEMORY_REQUEST and set
> osd_memory_target to that, and ignore the real limit.
>
> 3. Or, we could pick up on POD_MEMORY_REQUEST and set osd_memory_target to
> that, and then just check the hard limit and use that as a cap (to make
> sure POD_MEMORY_REQUEST isn't somehow > the hard limit, or the hard limit
> * some factor like .66 or .8).
>
> I like 2 just because it's simple.  There's no magic, and it doesn't
> invite extra complexity like monitoring /proc for changes to the cgroup
> limit at runtime.  But maybe we want that?  Does kubernetes even let you
> change these limits on running pods?

There could be a future where we want to request more RAM from the
operator when the OSD is degraded or the MDS wants to grow its cache.
Monitoring the cgroup limit would be necessary in that case.

-- 
Patrick Donnelly