Re: Ceph, container and memory

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 8 Mar 2019 06:56:20 -0600

On 3/7/19 5:25 PM, Sage Weil wrote:
On Thu, 7 Mar 2019, Patrick Donnelly wrote:
The problem here is that this is set based on POD_MEMORY_LIMIT, the point
at which we get ENOMEM, not the target POD_MEMORY_REQUEST, which is what
we should actually try to achieve.  IIUC the latter isn't going to be
found in any standard place in the container environment (besides the
kubernetes-specific POD_MEMORY_REQUEST).

With that caveat, it seems like we should *also* look at the cgroup limit
* some factor (e.g., .8) as use it as a ceiling for *_memory_target...
After thinking about this more and given that _REQUEST is not
available via the cgroup API, I'm wondering if _REQUEST is really
valuable or not. In practice (for Rook), what's valuable about
_REQUEST vs. _LIMIT?

We can still have options for setting the desired memory target but
why ask each operator to answer the question: "well, I have X amount
of RAM for this daemon but it probably should just use (X-Y) RAM to be
safe". Let's answer that question uniformly by adjusting config
defaults based on the hard-limit.
 From the osd side of things, the reality is that whatever
osd_memory_target is, our actual usage oscillates around that value by +/-
10%.  And in general that is probably what we want... the host will
generally have some slack and short bursts will be okay.  And setting a
hard limit too close to the target will mean we'll crash with ENOMEM too
frequently.

So, independent of what the mechanism is for setting the osd target,
I'd probably want to set osd_memory_target == POD_MEMORY_REQUEST,
and set POD_MEMORY_LIMIT 1.5x that as a safety stop.

1. We could get to the same place by setting
osd_memory_target_as_cgroup_limit_ratio to .66, but that would have to be
done in concert with whoever is setting POD_MEMORY_{LIMIT,REQUEST} so that
the ratio matches.

2. Or, we could teach the OSD to pick up on POD_MEMORY_REQUEST and set
osd_memory_target to that, and ignore the real limit.

3. Or, we could pick up on POD_MEMORY_REQUEST and set osd_memory_target to
that, and then just check the hard limit and use that as a cap (to make
sure POD_MEMORY_REQUEST isn't somehow > the hard limit, or the hard limit
* some factor like .66 or .8).

I like 2 just because it's simple.  There's no magic, and it doesn't
invite extra complexity like monitoring /proc for changes to the cgroup
limit at runtime.  But maybe we want that?  Does kubernetes even let you
change these limits on running pods?

s

FWIW, I don't think it really matters that much.  In some fashion we'll 
communicate to the OSD that the memory target should be set to some 
fraction of the container size limit.  There are lots of ways we can do 
it.  In other scenarios where resources are shared (bare-metal, 
whatever) we'll want to let the user decide.  I vote for whatever 
solution is the easiest to support with the least conditional corner 
cases.  I will note that in the new age-binning branch we assign memory 
at different priority levels.  We might be able to use 
POD_MEMORY_REQUEST in some fashion.

I'm more concerned about the memory oscillation and potential for 
hitting OOM.  When you've got a bunch of OSDs running on bare metal that 
oscillation isn't quite as scary since each OSD can hit it's beak at 
different points in time (though backfill/recovery or other global 
events might prove that a bad assumption).  So far switching the 
bluestore caches to trim-on-write instead of periodically in the in the 
mempool thread seems to help (especially with the new age-binning 
code).  I suspect on fast storage trim-on-write is avoiding some of that 
oscillation.  You can see my first attempt at it as part of my working 
age-binning branch here:

https://github.com/markhpc/ceph/commit/a7dd6890ca89f1c60f3e282e8b4f44f213a90fac

Patrick:  I've been (slowly as time permits) moving all of the osd 
memory tuning code into it's own manager class where you register caches 
and perhaps have it spawn the thread that watches process memory usage.  
I think we should try to coordinate efforts to make it as generic as 
possible so we can write all of this once for all of the daemons.

Mark