Re: Ceph, container and memory

Sage Weil <sweil@xxxxxxxxxx> · Thu, 7 Mar 2019 15:14:02 +0000 (UTC)

On Thu, 7 Mar 2019, Sebastien Han wrote:
> Let me take back something, afterthoughts, yes "memory.request" is
> what we are interested in, the only thing is that there is no
> /sys/fs/cgroup/memory equivalent.

Okay, if that's the case, then we have to take the limit from the 
kubernetes-specific environment variable.  The downside is it's 
kube-specific, but so be it.

> If we take the OSD example, we could set REQUEST to 1GB and LIMIT to
> 2GB (making up numbers again) which leaves 1GB for backfilling and
> other stuff.

Just a side-note here: there is no longer an recovery vs non-recovery 
memory utilization delta.  IIRC we even backported that config 
osd_pg_log_{min,max}_entries change to luminous too.

> The last thing is, as part of these 4GB how much do we want to give to
> the osd_memory_target.

I think we should set osd_memory_target to POD_MEMORY_REQUEST if it's 
non-zero.  If it is 0 and POD_MEMORY_LIMIT is set, I suggest .8 * that.

sage

> 
> Thanks!
> –––––––––
> Sébastien Han
> Principal Software Engineer, Storage Architect
> 
> "Always give 100%. Unless you're giving blood."
> 
> On Thu, Mar 7, 2019 at 9:51 AM Sebastien Han <shan@xxxxxxxxxx> wrote:
> >
> > Replying to everyone in a batch:
> >
> > * Yes, memory.limit is always >= memory.request. If memory.request is
> > unset and memory.limit then memory.request = memory.limit
> > k8s will refuse to schedule a POD if memory.limit < memory.request.
> > The interesting part, if memory.limit is set (e.g: 1GB) and
> > memory.request is unset, during the scheduling process, the scheduler
> > will do memory.request = memory.limit
> > However, once the POD is running, the env variables will respectively
> > be POD_MEMORY_LIMIT=1GB (in bytes) POD_MEMORY_REQUEST=0... So that's
> > confusing.
> > Generally, we do care about both LIMIT and REQUEST because REQUEST is
> > only used for scheduling so someone could do memory.request = 512MB
> > and memory.limit 1GB to make sure the POD can be scheduled.
> > In the Ceph context, this makes perfect sense since, for example, an
> > OSD might need (making numbers up) 2GB for REQUEST (memory consumed
> > 80% of the time but 4GB for LIMIT ("burstable" when we backfill).
> > So we should read LIMIT only I believe and apply our thresholds based
> > on the value we read if = 0.
> >
> > * The current Rook PR: https://github.com/rook/rook/pull/2764, which
> > sets both osd_memory_target and mds_cache_memory_limit at startup on
> > the daemon CLI.
> > I consider this as temporary until we get this done in Ceph natively
> > (by reading env vars).
> >
> > * LIMIT is what should be considered as the actual memory pool
> > available for a daemon, auto-tuning should be done based on that
> > value, thresholds and alerts too.
> >
> > * The cgroup equivalent of POD_MEMORY_LIMIT is
> > "/sys/fs/cgroup/memory/memory.limit_in_bytes"
> > If memory.limit_in_bytes is equal to 9223372036854771712 then there is
> > no limit and we shouldn't probably tune anything or try to come up
> > with a formula.
> > There is also memory.usage_in_bytes that can be useful.
> >
> > * The route to uniformity with *_memory_limit is something that would
> > help us in the short term to configure orchestrator where the long run
> > will be Ceph tuning that variable on its own.
> >
> > Thanks!
> > –––––––––
> > Sébastien Han
> > Principal Software Engineer, Storage Architect
> >
> > "Always give 100%. Unless you're giving blood."
> >
> > On Tue, Mar 5, 2019 at 7:06 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> > >
> > >
> > > On 3/5/19 11:51 AM, Patrick Donnelly wrote:
> > > > On Tue, Mar 5, 2019 at 8:41 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > > >>> If memory.requests is omitted for a container, it defaults to limits.
> > > >>> If memory.limits is not set, it defaults to 0 (unbounded).
> > > >>> If none of these 2 are specified then we don't tune anything because
> > > >>> we don't really know what to do.
> > > >>>
> > > >>> So far I've collected a couple of Ceph flags that are worth tuning:
> > > >>>
> > > >>> * mds_cache_memory_limit
> > > >>> * osd_memory_target
> > > >>>
> > > >>> These flags will be passed at instantiation time for the MDS and the OSD daemon.
> > > >>> Since most of the daemons have some cache flag, it'll be nice to unify
> > > >>> them with a new option --{daemon}-memory-target.
> > > >>> Currently I'm exposing POD properties via env var too that Ceph can
> > > >>> consume later for more autotuning (POD_{MEMORY,CPU}_LIMIT,
> > > >>> POD_{CPU,MEMORY}_REQUEST.
> > > >> Ignoring mds_cache_memory_limit for now; I think we should wait until we
> > > >> have mds_memory_target before doing any magic there.
> > > >>
> > > >> For the osd_memory_target, though, I think we could make the OSD pick up
> > > >> on the POD_MEMORY_REQUEST variable and, if present, set osd_memory_target
> > > >> to that value.  Or, instead of putting the burden on ceph, simply have
> > > >> rook pass --osd-memory-target on the command line, or (post-startup) do
> > > >> 'ceph daemon osd.N config set osd_memory_target ...'.  (The advantage of
> > > >> the latter is that it can more easily be overridden at runtime.)
> > > > Is POD_MEMORY_LIMIT|REQUEST standardized somewhere? Using an
> > > > environment variable to communicate resource restrictions is useful
> > > > but also hard to change on-the-fly. Can we (Ceph) read this
> > > > information from the cgroup the Ceph daemon has been assigned to?
> > > > Reducing the amount of configuration is one of our goals so if we can
> > > > make Ceph more aware of its environment as far as resource
> > > > constraints, we should go that route.
> > > >
> > > > The MDS should self-configure mds_cache_memory_limit based on
> > > > memory.requests. That takes the magic formula out of the hands of
> > > > users and forgetful devs :)
> > > >
> > > >> I'm not sure we have any specific action on the POD_MEMORY_LIMIT value..
> > > >> the OSD should really be aiming for the REQUEST value instead.
> > > > I agree we should focus on memory.requests.
> > > >
> > >
> > > I mentioned it in another doc, but I suspect it would be fairly easy to
> > > adapt the osd_memory_limit and autotuner code to work in the mds so long
> > > as we can adjust mds_cache_memory_limit on the fly.  It would be really
> > > nice to have all of the daemons conform to standard *_memory_limit
> > > interface.  That's useful both inside a container environment and on
> > > bare metal.
> > >
> > >
> > > Mark
> > >
> 
> 
>