Re: Ceph, container and memory

Sage Weil <sweil@xxxxxxxxxx> · Thu, 7 Mar 2019 23:25:16 +0000 (UTC)

On Thu, 7 Mar 2019, Patrick Donnelly wrote:
> > The problem here is that this is set based on POD_MEMORY_LIMIT, the point
> > at which we get ENOMEM, not the target POD_MEMORY_REQUEST, which is what
> > we should actually try to achieve.  IIUC the latter isn't going to be
> > found in any standard place in the container environment (besides the
> > kubernetes-specific POD_MEMORY_REQUEST).
> >
> > With that caveat, it seems like we should *also* look at the cgroup limit
> > * some factor (e.g., .8) as use it as a ceiling for *_memory_target...
> 
> After thinking about this more and given that _REQUEST is not
> available via the cgroup API, I'm wondering if _REQUEST is really
> valuable or not. In practice (for Rook), what's valuable about
> _REQUEST vs. _LIMIT?
> 
> We can still have options for setting the desired memory target but
> why ask each operator to answer the question: "well, I have X amount
> of RAM for this daemon but it probably should just use (X-Y) RAM to be
> safe". Let's answer that question uniformly by adjusting config
> defaults based on the hard-limit.

>From the osd side of things, the reality is that whatever 
osd_memory_target is, our actual usage oscillates around that value by +/- 
10%.  And in general that is probably what we want... the host will 
generally have some slack and short bursts will be okay.  And setting a 
hard limit too close to the target will mean we'll crash with ENOMEM too 
frequently.

So, independent of what the mechanism is for setting the osd target, 
I'd probably want to set osd_memory_target == POD_MEMORY_REQUEST, 
and set POD_MEMORY_LIMIT 1.5x that as a safety stop.

1. We could get to the same place by setting 
osd_memory_target_as_cgroup_limit_ratio to .66, but that would have to be 
done in concert with whoever is setting POD_MEMORY_{LIMIT,REQUEST} so that 
the ratio matches.

2. Or, we could teach the OSD to pick up on POD_MEMORY_REQUEST and set 
osd_memory_target to that, and ignore the real limit.

3. Or, we could pick up on POD_MEMORY_REQUEST and set osd_memory_target to 
that, and then just check the hard limit and use that as a cap (to make 
sure POD_MEMORY_REQUEST isn't somehow > the hard limit, or the hard limit 
* some factor like .66 or .8).

I like 2 just because it's simple.  There's no magic, and it doesn't 
invite extra complexity like monitoring /proc for changes to the cgroup 
limit at runtime.  But maybe we want that?  Does kubernetes even let you 
change these limits on running pods?

s