Re: Ceph, container and memory

Sebastien Han <shan@xxxxxxxxxx> · Fri, 8 Mar 2019 16:15:18 +0100

Answering in a batch again:

* We only run a single process in a pod so there is no need to worries
about sharing the cgroup limit with another process
* If we want to be container-ready but orchestrator agnostic (can be
k8s, swarm whatever) then we should stick with reading
"/sys/fs/cgroup/memory/memory.limit_in_bytes" and apply a ceiling to
it
I don't think it's that complicated, we can simply:"
  * if "/sys/fs/cgroup/memory/memory.limit_in_bytes" ==
"9223372036854771712 then do nothing, else get that value and apply .8
to it? That value will result in memory_target

How does that sound?

Thanks!
–––––––––
Sébastien Han
Principal Software Engineer, Storage Architect

"Always give 100%. Unless you're giving blood."

On Fri, Mar 8, 2019 at 2:09 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>
>
> On 3/8/19 6:56 AM, Mark Nelson wrote:
> >
> > On 3/7/19 5:25 PM, Sage Weil wrote:
> >> On Thu, 7 Mar 2019, Patrick Donnelly wrote:
> >>>> The problem here is that this is set based on POD_MEMORY_LIMIT, the
> >>>> point
> >>>> at which we get ENOMEM, not the target POD_MEMORY_REQUEST, which is
> >>>> what
> >>>> we should actually try to achieve.  IIUC the latter isn't going to be
> >>>> found in any standard place in the container environment (besides the
> >>>> kubernetes-specific POD_MEMORY_REQUEST).
> >>>>
> >>>> With that caveat, it seems like we should *also* look at the cgroup
> >>>> limit
> >>>> * some factor (e.g., .8) as use it as a ceiling for *_memory_target...
> >>> After thinking about this more and given that _REQUEST is not
> >>> available via the cgroup API, I'm wondering if _REQUEST is really
> >>> valuable or not. In practice (for Rook), what's valuable about
> >>> _REQUEST vs. _LIMIT?
> >>>
> >>> We can still have options for setting the desired memory target but
> >>> why ask each operator to answer the question: "well, I have X amount
> >>> of RAM for this daemon but it probably should just use (X-Y) RAM to be
> >>> safe". Let's answer that question uniformly by adjusting config
> >>> defaults based on the hard-limit.
> >>  From the osd side of things, the reality is that whatever
> >> osd_memory_target is, our actual usage oscillates around that value
> >> by +/-
> >> 10%.  And in general that is probably what we want... the host will
> >> generally have some slack and short bursts will be okay.  And setting a
> >> hard limit too close to the target will mean we'll crash with ENOMEM too
> >> frequently.
> >>
> >> So, independent of what the mechanism is for setting the osd target,
> >> I'd probably want to set osd_memory_target == POD_MEMORY_REQUEST,
> >> and set POD_MEMORY_LIMIT 1.5x that as a safety stop.
> >>
> >> 1. We could get to the same place by setting
> >> osd_memory_target_as_cgroup_limit_ratio to .66, but that would have
> >> to be
> >> done in concert with whoever is setting POD_MEMORY_{LIMIT,REQUEST} so
> >> that
> >> the ratio matches.
> >>
> >> 2. Or, we could teach the OSD to pick up on POD_MEMORY_REQUEST and set
> >> osd_memory_target to that, and ignore the real limit.
> >>
> >> 3. Or, we could pick up on POD_MEMORY_REQUEST and set
> >> osd_memory_target to
> >> that, and then just check the hard limit and use that as a cap (to make
> >> sure POD_MEMORY_REQUEST isn't somehow > the hard limit, or the hard
> >> limit
> >> * some factor like .66 or .8).
> >>
> >> I like 2 just because it's simple.  There's no magic, and it doesn't
> >> invite extra complexity like monitoring /proc for changes to the cgroup
> >> limit at runtime.  But maybe we want that?  Does kubernetes even let you
> >> change these limits on running pods?
> >>
> >> s
> >>
> >>
> >>
> >
> > FWIW, I don't think it really matters that much.  In some fashion
> > we'll communicate to the OSD that the memory target should be set to
> > some fraction of the container size limit.  There are lots of ways we
> > can do it.  In other scenarios where resources are shared (bare-metal,
> > whatever) we'll want to let the user decide.  I vote for whatever
> > solution is the easiest to support with the least conditional corner
> > cases.  I will note that in the new age-binning branch we assign
> > memory at different priority levels.  We might be able to use
> > POD_MEMORY_REQUEST in some fashion.
> >
> >
> > I'm more concerned about the memory oscillation and potential for
> > hitting OOM.  When you've got a bunch of OSDs running on bare metal
> > that oscillation isn't quite as scary since each OSD can hit it's beak
> > at different points in time (though backfill/recovery or other global
> > events might prove that a bad assumption).  So far switching the
> > bluestore caches to trim-on-write instead of periodically in the in
> > the mempool thread seems to help (especially with the new age-binning
> > code).  I suspect on fast storage trim-on-write is avoiding some of
> > that oscillation.  You can see my first attempt at it as part of my
> > working age-binning branch here:
> >
> >
> > https://github.com/markhpc/ceph/commit/a7dd6890ca89f1c60f3e282e8b4f44f213a90fac
> >
> >
> >
> > Patrick:  I've been (slowly as time permits) moving all of the osd
> > memory tuning code into it's own manager class where you register
> > caches and perhaps have it spawn the thread that watches process
> > memory usage.  I think we should try to coordinate efforts to make it
> > as generic as possible so we can write all of this once for all of the
> > daemons.
>
>
> Should have included a link to what's currently in the manager in my
> test branch:
>
>
> https://github.com/markhpc/ceph/blob/746ca05a92178b615cffb71d2f2ecb26a1b4d3cb/src/common/PriorityCache.cc
>
>
> Right now tune_memory is called by some external thread, but we could
> have the manager take care of it itself (I'm fine either way).  The idea
> here is that in the mds you'd create a manager like so:
>
>
> https://github.com/markhpc/ceph/blob/746ca05a92178b615cffb71d2f2ecb26a1b4d3cb/src/os/bluestore/BlueStore.cc#L3516-L3522
>
>
> have your cache implement the PriorityCache interface:
>
>
> https://github.com/markhpc/ceph/blob/746ca05a92178b615cffb71d2f2ecb26a1b4d3cb/src/common/PriorityCache.h
>
>
> If you don't care about priorities (and especially if you are only have
> one thing to balance) you can just have the cache request memory at a
> single PRI level like PRI0.  In the OSD I've got code to implement
> age-binning in the caches so that we have an idea of the relative age of
> what's in each cache.
>
>
> >
> >
> > Mark
> >