Re: Ceph, container and memory

Sage Weil <sweil@xxxxxxxxxx> · Thu, 7 Mar 2019 15:26:02 +0000 (UTC)

On Thu, 7 Mar 2019, Sage Weil wrote:
> On Thu, 7 Mar 2019, Sebastien Han wrote:
> > Let me take back something, afterthoughts, yes "memory.request" is
> > what we are interested in, the only thing is that there is no
> > /sys/fs/cgroup/memory equivalent.
> 
> Okay, if that's the case, then we have to take the limit from the 
> kubernetes-specific environment variable.  The downside is it's 
> kube-specific, but so be it.
> 
> > If we take the OSD example, we could set REQUEST to 1GB and LIMIT to
> > 2GB (making up numbers again) which leaves 1GB for backfilling and
> > other stuff.
> 
> Just a side-note here: there is no longer an recovery vs non-recovery 
> memory utilization delta.  IIRC we even backported that config 
> osd_pg_log_{min,max}_entries change to luminous too.
> 
> > The last thing is, as part of these 4GB how much do we want to give to
> > the osd_memory_target.
> 
> I think we should set osd_memory_target to POD_MEMORY_REQUEST if it's 
> non-zero.  If it is 0 and POD_MEMORY_LIMIT is set, I suggest .8 * that.

Hmm, I think the place to do that is here:

  https://github.com/ceph/ceph/blob/master/src/common/config.cc#L456

It's a bit awkward to parse the env variable into one of 
{osd,mds,mon}_memory_target based on the daemon type, but it could be done 
(once the others exist).

I wonder if, instead, we should have named the option just memory_target.  
:/

sage

> 
> sage
> 
> > 
> > Thanks!
> > –––––––––
> > Sébastien Han
> > Principal Software Engineer, Storage Architect
> > 
> > "Always give 100%. Unless you're giving blood."
> > 
> > On Thu, Mar 7, 2019 at 9:51 AM Sebastien Han <shan@xxxxxxxxxx> wrote:
> > >
> > > Replying to everyone in a batch:
> > >
> > > * Yes, memory.limit is always >= memory.request. If memory.request is
> > > unset and memory.limit then memory.request = memory.limit
> > > k8s will refuse to schedule a POD if memory.limit < memory.request.
> > > The interesting part, if memory.limit is set (e.g: 1GB) and
> > > memory.request is unset, during the scheduling process, the scheduler
> > > will do memory.request = memory.limit
> > > However, once the POD is running, the env variables will respectively
> > > be POD_MEMORY_LIMIT=1GB (in bytes) POD_MEMORY_REQUEST=0... So that's
> > > confusing.
> > > Generally, we do care about both LIMIT and REQUEST because REQUEST is
> > > only used for scheduling so someone could do memory.request = 512MB
> > > and memory.limit 1GB to make sure the POD can be scheduled.
> > > In the Ceph context, this makes perfect sense since, for example, an
> > > OSD might need (making numbers up) 2GB for REQUEST (memory consumed
> > > 80% of the time but 4GB for LIMIT ("burstable" when we backfill).
> > > So we should read LIMIT only I believe and apply our thresholds based
> > > on the value we read if = 0.
> > >
> > > * The current Rook PR: https://github.com/rook/rook/pull/2764, which
> > > sets both osd_memory_target and mds_cache_memory_limit at startup on
> > > the daemon CLI.
> > > I consider this as temporary until we get this done in Ceph natively
> > > (by reading env vars).
> > >
> > > * LIMIT is what should be considered as the actual memory pool
> > > available for a daemon, auto-tuning should be done based on that
> > > value, thresholds and alerts too.
> > >
> > > * The cgroup equivalent of POD_MEMORY_LIMIT is
> > > "/sys/fs/cgroup/memory/memory.limit_in_bytes"
> > > If memory.limit_in_bytes is equal to 9223372036854771712 then there is
> > > no limit and we shouldn't probably tune anything or try to come up
> > > with a formula.
> > > There is also memory.usage_in_bytes that can be useful.
> > >
> > > * The route to uniformity with *_memory_limit is something that would
> > > help us in the short term to configure orchestrator where the long run
> > > will be Ceph tuning that variable on its own.
> > >
> > > Thanks!
> > > –––––––––
> > > Sébastien Han
> > > Principal Software Engineer, Storage Architect
> > >
> > > "Always give 100%. Unless you're giving blood."
> > >
> > > On Tue, Mar 5, 2019 at 7:06 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> > > >
> > > >
> > > > On 3/5/19 11:51 AM, Patrick Donnelly wrote:
> > > > > On Tue, Mar 5, 2019 at 8:41 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > > > >>> If memory.requests is omitted for a container, it defaults to limits.
> > > > >>> If memory.limits is not set, it defaults to 0 (unbounded).
> > > > >>> If none of these 2 are specified then we don't tune anything because
> > > > >>> we don't really know what to do.
> > > > >>>
> > > > >>> So far I've collected a couple of Ceph flags that are worth tuning:
> > > > >>>
> > > > >>> * mds_cache_memory_limit
> > > > >>> * osd_memory_target
> > > > >>>
> > > > >>> These flags will be passed at instantiation time for the MDS and the OSD daemon.
> > > > >>> Since most of the daemons have some cache flag, it'll be nice to unify
> > > > >>> them with a new option --{daemon}-memory-target.
> > > > >>> Currently I'm exposing POD properties via env var too that Ceph can
> > > > >>> consume later for more autotuning (POD_{MEMORY,CPU}_LIMIT,
> > > > >>> POD_{CPU,MEMORY}_REQUEST.
> > > > >> Ignoring mds_cache_memory_limit for now; I think we should wait until we
> > > > >> have mds_memory_target before doing any magic there.
> > > > >>
> > > > >> For the osd_memory_target, though, I think we could make the OSD pick up
> > > > >> on the POD_MEMORY_REQUEST variable and, if present, set osd_memory_target
> > > > >> to that value.  Or, instead of putting the burden on ceph, simply have
> > > > >> rook pass --osd-memory-target on the command line, or (post-startup) do
> > > > >> 'ceph daemon osd.N config set osd_memory_target ...'.  (The advantage of
> > > > >> the latter is that it can more easily be overridden at runtime.)
> > > > > Is POD_MEMORY_LIMIT|REQUEST standardized somewhere? Using an
> > > > > environment variable to communicate resource restrictions is useful
> > > > > but also hard to change on-the-fly. Can we (Ceph) read this
> > > > > information from the cgroup the Ceph daemon has been assigned to?
> > > > > Reducing the amount of configuration is one of our goals so if we can
> > > > > make Ceph more aware of its environment as far as resource
> > > > > constraints, we should go that route.
> > > > >
> > > > > The MDS should self-configure mds_cache_memory_limit based on
> > > > > memory.requests. That takes the magic formula out of the hands of
> > > > > users and forgetful devs :)
> > > > >
> > > > >> I'm not sure we have any specific action on the POD_MEMORY_LIMIT value..
> > > > >> the OSD should really be aiming for the REQUEST value instead.
> > > > > I agree we should focus on memory.requests.
> > > > >
> > > >
> > > > I mentioned it in another doc, but I suspect it would be fairly easy to
> > > > adapt the osd_memory_limit and autotuner code to work in the mds so long
> > > > as we can adjust mds_cache_memory_limit on the fly.  It would be really
> > > > nice to have all of the daemons conform to standard *_memory_limit
> > > > interface.  That's useful both inside a container environment and on
> > > > bare metal.
> > > >
> > > >
> > > > Mark
> > > >
> > 
> > 
> >