Re: Ceph, container and memory

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 8 Mar 2019 09:31:32 -0600

On 3/8/19 9:15 AM, Sebastien Han wrote:
Answering in a batch again:

* We only run a single process in a pod so there is no need to worries
about sharing the cgroup limit with another process
* If we want to be container-ready but orchestrator agnostic (can be
k8s, swarm whatever) then we should stick with reading
"/sys/fs/cgroup/memory/memory.limit_in_bytes" and apply a ceiling to
it
I don't think it's that complicated, we can simply:"
   * if "/sys/fs/cgroup/memory/memory.limit_in_bytes" ==
"9223372036854771712 then do nothing, else get that value and apply .8
to it? That value will result in memory_target

How does that sound?

Let's try it, see how it works, and revisit if it doesn't. :)  In my 
latest branch that includes the PriorityCache manager, the target memory 
is set in the constructor or by calling set_target_memory(uint64_t 
target).  Right now that's just set in the constructor when the cache 
manager is created in the bluestore mempool thread, but if we start 
using the cache manger for everything we could have it take on more of 
that functionality itself (read the cgroup limit, register it's own 
config option listener for memory_target changes, etc).

Mark

Thanks!
–––––––––
Sébastien Han
Principal Software Engineer, Storage Architect

"Always give 100%. Unless you're giving blood."

On Fri, Mar 8, 2019 at 2:09 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:

On 3/8/19 6:56 AM, Mark Nelson wrote:
On 3/7/19 5:25 PM, Sage Weil wrote:
On Thu, 7 Mar 2019, Patrick Donnelly wrote:
The problem here is that this is set based on POD_MEMORY_LIMIT, the
point
at which we get ENOMEM, not the target POD_MEMORY_REQUEST, which is
what
we should actually try to achieve.  IIUC the latter isn't going to be
found in any standard place in the container environment (besides the
kubernetes-specific POD_MEMORY_REQUEST).

With that caveat, it seems like we should *also* look at the cgroup
limit
* some factor (e.g., .8) as use it as a ceiling for *_memory_target...
After thinking about this more and given that _REQUEST is not
available via the cgroup API, I'm wondering if _REQUEST is really
valuable or not. In practice (for Rook), what's valuable about
_REQUEST vs. _LIMIT?

We can still have options for setting the desired memory target but
why ask each operator to answer the question: "well, I have X amount
of RAM for this daemon but it probably should just use (X-Y) RAM to be
safe". Let's answer that question uniformly by adjusting config
defaults based on the hard-limit.
  From the osd side of things, the reality is that whatever
osd_memory_target is, our actual usage oscillates around that value
by +/-
10%.  And in general that is probably what we want... the host will
generally have some slack and short bursts will be okay.  And setting a
hard limit too close to the target will mean we'll crash with ENOMEM too
frequently.

So, independent of what the mechanism is for setting the osd target,
I'd probably want to set osd_memory_target == POD_MEMORY_REQUEST,
and set POD_MEMORY_LIMIT 1.5x that as a safety stop.

1. We could get to the same place by setting
osd_memory_target_as_cgroup_limit_ratio to .66, but that would have
to be
done in concert with whoever is setting POD_MEMORY_{LIMIT,REQUEST} so
that
the ratio matches.

2. Or, we could teach the OSD to pick up on POD_MEMORY_REQUEST and set
osd_memory_target to that, and ignore the real limit.

3. Or, we could pick up on POD_MEMORY_REQUEST and set
osd_memory_target to
that, and then just check the hard limit and use that as a cap (to make
sure POD_MEMORY_REQUEST isn't somehow > the hard limit, or the hard
limit
* some factor like .66 or .8).

I like 2 just because it's simple.  There's no magic, and it doesn't
invite extra complexity like monitoring /proc for changes to the cgroup
limit at runtime.  But maybe we want that?  Does kubernetes even let you
change these limits on running pods?

s

FWIW, I don't think it really matters that much.  In some fashion
we'll communicate to the OSD that the memory target should be set to
some fraction of the container size limit.  There are lots of ways we
can do it.  In other scenarios where resources are shared (bare-metal,
whatever) we'll want to let the user decide.  I vote for whatever
solution is the easiest to support with the least conditional corner
cases.  I will note that in the new age-binning branch we assign
memory at different priority levels.  We might be able to use
POD_MEMORY_REQUEST in some fashion.

I'm more concerned about the memory oscillation and potential for
hitting OOM.  When you've got a bunch of OSDs running on bare metal
that oscillation isn't quite as scary since each OSD can hit it's beak
at different points in time (though backfill/recovery or other global
events might prove that a bad assumption).  So far switching the
bluestore caches to trim-on-write instead of periodically in the in
the mempool thread seems to help (especially with the new age-binning
code).  I suspect on fast storage trim-on-write is avoiding some of
that oscillation.  You can see my first attempt at it as part of my
working age-binning branch here:

https://github.com/markhpc/ceph/commit/a7dd6890ca89f1c60f3e282e8b4f44f213a90fac

Patrick:  I've been (slowly as time permits) moving all of the osd
memory tuning code into it's own manager class where you register
caches and perhaps have it spawn the thread that watches process
memory usage.  I think we should try to coordinate efforts to make it
as generic as possible so we can write all of this once for all of the
daemons.

Should have included a link to what's currently in the manager in my
test branch:

https://github.com/markhpc/ceph/blob/746ca05a92178b615cffb71d2f2ecb26a1b4d3cb/src/common/PriorityCache.cc

Right now tune_memory is called by some external thread, but we could
have the manager take care of it itself (I'm fine either way).  The idea
here is that in the mds you'd create a manager like so:

https://github.com/markhpc/ceph/blob/746ca05a92178b615cffb71d2f2ecb26a1b4d3cb/src/os/bluestore/BlueStore.cc#L3516-L3522

have your cache implement the PriorityCache interface:

https://github.com/markhpc/ceph/blob/746ca05a92178b615cffb71d2f2ecb26a1b4d3cb/src/common/PriorityCache.h

If you don't care about priorities (and especially if you are only have
one thing to balance) you can just have the cache request memory at a
single PRI level like PRI0.  In the OSD I've got code to implement
age-binning in the caches so that we have an idea of the relative age of
what's in each cache.

Mark