Re: Ceph, container and memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 3/8/19 6:56 AM, Mark Nelson wrote:

On 3/7/19 5:25 PM, Sage Weil wrote:
On Thu, 7 Mar 2019, Patrick Donnelly wrote:
The problem here is that this is set based on POD_MEMORY_LIMIT, the point at which we get ENOMEM, not the target POD_MEMORY_REQUEST, which is what
we should actually try to achieve.  IIUC the latter isn't going to be
found in any standard place in the container environment (besides the
kubernetes-specific POD_MEMORY_REQUEST).

With that caveat, it seems like we should *also* look at the cgroup limit
* some factor (e.g., .8) as use it as a ceiling for *_memory_target...
After thinking about this more and given that _REQUEST is not
available via the cgroup API, I'm wondering if _REQUEST is really
valuable or not. In practice (for Rook), what's valuable about
_REQUEST vs. _LIMIT?

We can still have options for setting the desired memory target but
why ask each operator to answer the question: "well, I have X amount
of RAM for this daemon but it probably should just use (X-Y) RAM to be
safe". Let's answer that question uniformly by adjusting config
defaults based on the hard-limit.
 From the osd side of things, the reality is that whatever
osd_memory_target is, our actual usage oscillates around that value by +/-
10%.  And in general that is probably what we want... the host will
generally have some slack and short bursts will be okay.  And setting a
hard limit too close to the target will mean we'll crash with ENOMEM too
frequently.

So, independent of what the mechanism is for setting the osd target,
I'd probably want to set osd_memory_target == POD_MEMORY_REQUEST,
and set POD_MEMORY_LIMIT 1.5x that as a safety stop.

1. We could get to the same place by setting
osd_memory_target_as_cgroup_limit_ratio to .66, but that would have to be done in concert with whoever is setting POD_MEMORY_{LIMIT,REQUEST} so that
the ratio matches.

2. Or, we could teach the OSD to pick up on POD_MEMORY_REQUEST and set
osd_memory_target to that, and ignore the real limit.

3. Or, we could pick up on POD_MEMORY_REQUEST and set osd_memory_target to
that, and then just check the hard limit and use that as a cap (to make
sure POD_MEMORY_REQUEST isn't somehow > the hard limit, or the hard limit
* some factor like .66 or .8).

I like 2 just because it's simple.  There's no magic, and it doesn't
invite extra complexity like monitoring /proc for changes to the cgroup
limit at runtime.  But maybe we want that?  Does kubernetes even let you
change these limits on running pods?

s




FWIW, I don't think it really matters that much.  In some fashion we'll communicate to the OSD that the memory target should be set to some fraction of the container size limit.  There are lots of ways we can do it.  In other scenarios where resources are shared (bare-metal, whatever) we'll want to let the user decide.  I vote for whatever solution is the easiest to support with the least conditional corner cases.  I will note that in the new age-binning branch we assign memory at different priority levels.  We might be able to use POD_MEMORY_REQUEST in some fashion.


I'm more concerned about the memory oscillation and potential for hitting OOM.  When you've got a bunch of OSDs running on bare metal that oscillation isn't quite as scary since each OSD can hit it's beak at different points in time (though backfill/recovery or other global events might prove that a bad assumption).  So far switching the bluestore caches to trim-on-write instead of periodically in the in the mempool thread seems to help (especially with the new age-binning code).  I suspect on fast storage trim-on-write is avoiding some of that oscillation.  You can see my first attempt at it as part of my working age-binning branch here:


https://github.com/markhpc/ceph/commit/a7dd6890ca89f1c60f3e282e8b4f44f213a90fac


Patrick:  I've been (slowly as time permits) moving all of the osd memory tuning code into it's own manager class where you register caches and perhaps have it spawn the thread that watches process memory usage.  I think we should try to coordinate efforts to make it as generic as possible so we can write all of this once for all of the daemons.


Should have included a link to what's currently in the manager in my test branch:


https://github.com/markhpc/ceph/blob/746ca05a92178b615cffb71d2f2ecb26a1b4d3cb/src/common/PriorityCache.cc


Right now tune_memory is called by some external thread, but we could have the manager take care of it itself (I'm fine either way).  The idea here is that in the mds you'd create a manager like so:


https://github.com/markhpc/ceph/blob/746ca05a92178b615cffb71d2f2ecb26a1b4d3cb/src/os/bluestore/BlueStore.cc#L3516-L3522


have your cache implement the PriorityCache interface:


https://github.com/markhpc/ceph/blob/746ca05a92178b615cffb71d2f2ecb26a1b4d3cb/src/common/PriorityCache.h


If you don't care about priorities (and especially if you are only have one thing to balance) you can just have the cache request memory at a single PRI level like PRI0.  In the OSD I've got code to implement age-binning in the caches so that we have an idea of the relative age of what's in each cache.




Mark




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux