Answering in a batch again: * We only run a single process in a pod so there is no need to worries about sharing the cgroup limit with another process * If we want to be container-ready but orchestrator agnostic (can be k8s, swarm whatever) then we should stick with reading "/sys/fs/cgroup/memory/memory.limit_in_bytes" and apply a ceiling to it I don't think it's that complicated, we can simply:" * if "/sys/fs/cgroup/memory/memory.limit_in_bytes" == "9223372036854771712 then do nothing, else get that value and apply .8 to it? That value will result in memory_target How does that sound? Thanks! ––––––––– Sébastien Han Principal Software Engineer, Storage Architect "Always give 100%. Unless you're giving blood." On Fri, Mar 8, 2019 at 2:09 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote: > > > On 3/8/19 6:56 AM, Mark Nelson wrote: > > > > On 3/7/19 5:25 PM, Sage Weil wrote: > >> On Thu, 7 Mar 2019, Patrick Donnelly wrote: > >>>> The problem here is that this is set based on POD_MEMORY_LIMIT, the > >>>> point > >>>> at which we get ENOMEM, not the target POD_MEMORY_REQUEST, which is > >>>> what > >>>> we should actually try to achieve. IIUC the latter isn't going to be > >>>> found in any standard place in the container environment (besides the > >>>> kubernetes-specific POD_MEMORY_REQUEST). > >>>> > >>>> With that caveat, it seems like we should *also* look at the cgroup > >>>> limit > >>>> * some factor (e.g., .8) as use it as a ceiling for *_memory_target... > >>> After thinking about this more and given that _REQUEST is not > >>> available via the cgroup API, I'm wondering if _REQUEST is really > >>> valuable or not. In practice (for Rook), what's valuable about > >>> _REQUEST vs. _LIMIT? > >>> > >>> We can still have options for setting the desired memory target but > >>> why ask each operator to answer the question: "well, I have X amount > >>> of RAM for this daemon but it probably should just use (X-Y) RAM to be > >>> safe". Let's answer that question uniformly by adjusting config > >>> defaults based on the hard-limit. > >> From the osd side of things, the reality is that whatever > >> osd_memory_target is, our actual usage oscillates around that value > >> by +/- > >> 10%. And in general that is probably what we want... the host will > >> generally have some slack and short bursts will be okay. And setting a > >> hard limit too close to the target will mean we'll crash with ENOMEM too > >> frequently. > >> > >> So, independent of what the mechanism is for setting the osd target, > >> I'd probably want to set osd_memory_target == POD_MEMORY_REQUEST, > >> and set POD_MEMORY_LIMIT 1.5x that as a safety stop. > >> > >> 1. We could get to the same place by setting > >> osd_memory_target_as_cgroup_limit_ratio to .66, but that would have > >> to be > >> done in concert with whoever is setting POD_MEMORY_{LIMIT,REQUEST} so > >> that > >> the ratio matches. > >> > >> 2. Or, we could teach the OSD to pick up on POD_MEMORY_REQUEST and set > >> osd_memory_target to that, and ignore the real limit. > >> > >> 3. Or, we could pick up on POD_MEMORY_REQUEST and set > >> osd_memory_target to > >> that, and then just check the hard limit and use that as a cap (to make > >> sure POD_MEMORY_REQUEST isn't somehow > the hard limit, or the hard > >> limit > >> * some factor like .66 or .8). > >> > >> I like 2 just because it's simple. There's no magic, and it doesn't > >> invite extra complexity like monitoring /proc for changes to the cgroup > >> limit at runtime. But maybe we want that? Does kubernetes even let you > >> change these limits on running pods? > >> > >> s > >> > >> > >> > > > > FWIW, I don't think it really matters that much. In some fashion > > we'll communicate to the OSD that the memory target should be set to > > some fraction of the container size limit. There are lots of ways we > > can do it. In other scenarios where resources are shared (bare-metal, > > whatever) we'll want to let the user decide. I vote for whatever > > solution is the easiest to support with the least conditional corner > > cases. I will note that in the new age-binning branch we assign > > memory at different priority levels. We might be able to use > > POD_MEMORY_REQUEST in some fashion. > > > > > > I'm more concerned about the memory oscillation and potential for > > hitting OOM. When you've got a bunch of OSDs running on bare metal > > that oscillation isn't quite as scary since each OSD can hit it's beak > > at different points in time (though backfill/recovery or other global > > events might prove that a bad assumption). So far switching the > > bluestore caches to trim-on-write instead of periodically in the in > > the mempool thread seems to help (especially with the new age-binning > > code). I suspect on fast storage trim-on-write is avoiding some of > > that oscillation. You can see my first attempt at it as part of my > > working age-binning branch here: > > > > > > https://github.com/markhpc/ceph/commit/a7dd6890ca89f1c60f3e282e8b4f44f213a90fac > > > > > > > > Patrick: I've been (slowly as time permits) moving all of the osd > > memory tuning code into it's own manager class where you register > > caches and perhaps have it spawn the thread that watches process > > memory usage. I think we should try to coordinate efforts to make it > > as generic as possible so we can write all of this once for all of the > > daemons. > > > Should have included a link to what's currently in the manager in my > test branch: > > > https://github.com/markhpc/ceph/blob/746ca05a92178b615cffb71d2f2ecb26a1b4d3cb/src/common/PriorityCache.cc > > > Right now tune_memory is called by some external thread, but we could > have the manager take care of it itself (I'm fine either way). The idea > here is that in the mds you'd create a manager like so: > > > https://github.com/markhpc/ceph/blob/746ca05a92178b615cffb71d2f2ecb26a1b4d3cb/src/os/bluestore/BlueStore.cc#L3516-L3522 > > > have your cache implement the PriorityCache interface: > > > https://github.com/markhpc/ceph/blob/746ca05a92178b615cffb71d2f2ecb26a1b4d3cb/src/common/PriorityCache.h > > > If you don't care about priorities (and especially if you are only have > one thing to balance) you can just have the cache request memory at a > single PRI level like PRI0. In the OSD I've got code to implement > age-binning in the caches so that we have an idea of the relative age of > what's in each cache. > > > > > > > > Mark > >