Re: Ceph, container and memory

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 7 Mar 2019 09:31:01 -0600

On 3/7/19 9:26 AM, Sage Weil wrote:
On Thu, 7 Mar 2019, Sage Weil wrote:
On Thu, 7 Mar 2019, Sebastien Han wrote:
Let me take back something, afterthoughts, yes "memory.request" is
what we are interested in, the only thing is that there is no
/sys/fs/cgroup/memory equivalent.
Okay, if that's the case, then we have to take the limit from the
kubernetes-specific environment variable.  The downside is it's
kube-specific, but so be it.

If we take the OSD example, we could set REQUEST to 1GB and LIMIT to
2GB (making up numbers again) which leaves 1GB for backfilling and
other stuff.
Just a side-note here: there is no longer an recovery vs non-recovery
memory utilization delta.  IIRC we even backported that config
osd_pg_log_{min,max}_entries change to luminous too.

The last thing is, as part of these 4GB how much do we want to give to
the osd_memory_target.
I think we should set osd_memory_target to POD_MEMORY_REQUEST if it's
non-zero.  If it is 0 and POD_MEMORY_LIMIT is set, I suggest .8 * that.
Hmm, I think the place to do that is here:

   https://github.com/ceph/ceph/blob/master/src/common/config.cc#L456

It's a bit awkward to parse the env variable into one of
{osd,mds,mon}_memory_target based on the daemon type, but it could be done
(once the others exist).

I wonder if, instead, we should have named the option just memory_target.
:/

sage

That's a very container-centric way of looking at things. ;)

Mark

sage

Thanks!
–––––––––
Sébastien Han
Principal Software Engineer, Storage Architect

"Always give 100%. Unless you're giving blood."

On Thu, Mar 7, 2019 at 9:51 AM Sebastien Han <shan@xxxxxxxxxx> wrote:
Replying to everyone in a batch:

* Yes, memory.limit is always >= memory.request. If memory.request is
unset and memory.limit then memory.request = memory.limit
k8s will refuse to schedule a POD if memory.limit < memory.request.
The interesting part, if memory.limit is set (e.g: 1GB) and
memory.request is unset, during the scheduling process, the scheduler
will do memory.request = memory.limit
However, once the POD is running, the env variables will respectively
be POD_MEMORY_LIMIT=1GB (in bytes) POD_MEMORY_REQUEST=0... So that's
confusing.
Generally, we do care about both LIMIT and REQUEST because REQUEST is
only used for scheduling so someone could do memory.request = 512MB
and memory.limit 1GB to make sure the POD can be scheduled.
In the Ceph context, this makes perfect sense since, for example, an
OSD might need (making numbers up) 2GB for REQUEST (memory consumed
80% of the time but 4GB for LIMIT ("burstable" when we backfill).
So we should read LIMIT only I believe and apply our thresholds based
on the value we read if = 0.

* The current Rook PR: https://github.com/rook/rook/pull/2764, which
sets both osd_memory_target and mds_cache_memory_limit at startup on
the daemon CLI.
I consider this as temporary until we get this done in Ceph natively
(by reading env vars).

* LIMIT is what should be considered as the actual memory pool
available for a daemon, auto-tuning should be done based on that
value, thresholds and alerts too.

* The cgroup equivalent of POD_MEMORY_LIMIT is
"/sys/fs/cgroup/memory/memory.limit_in_bytes"
If memory.limit_in_bytes is equal to 9223372036854771712 then there is
no limit and we shouldn't probably tune anything or try to come up
with a formula.
There is also memory.usage_in_bytes that can be useful.

* The route to uniformity with *_memory_limit is something that would
help us in the short term to configure orchestrator where the long run
will be Ceph tuning that variable on its own.

Thanks!
–––––––––
Sébastien Han
Principal Software Engineer, Storage Architect

"Always give 100%. Unless you're giving blood."

On Tue, Mar 5, 2019 at 7:06 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:

On 3/5/19 11:51 AM, Patrick Donnelly wrote:
On Tue, Mar 5, 2019 at 8:41 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
If memory.requests is omitted for a container, it defaults to limits.
If memory.limits is not set, it defaults to 0 (unbounded).
If none of these 2 are specified then we don't tune anything because
we don't really know what to do.

So far I've collected a couple of Ceph flags that are worth tuning:

* mds_cache_memory_limit
* osd_memory_target

These flags will be passed at instantiation time for the MDS and the OSD daemon.
Since most of the daemons have some cache flag, it'll be nice to unify
them with a new option --{daemon}-memory-target.
Currently I'm exposing POD properties via env var too that Ceph can
consume later for more autotuning (POD_{MEMORY,CPU}_LIMIT,
POD_{CPU,MEMORY}_REQUEST.
Ignoring mds_cache_memory_limit for now; I think we should wait until we
have mds_memory_target before doing any magic there.

For the osd_memory_target, though, I think we could make the OSD pick up
on the POD_MEMORY_REQUEST variable and, if present, set osd_memory_target
to that value.  Or, instead of putting the burden on ceph, simply have
rook pass --osd-memory-target on the command line, or (post-startup) do
'ceph daemon osd.N config set osd_memory_target ...'.  (The advantage of
the latter is that it can more easily be overridden at runtime.)
Is POD_MEMORY_LIMIT|REQUEST standardized somewhere? Using an
environment variable to communicate resource restrictions is useful
but also hard to change on-the-fly. Can we (Ceph) read this
information from the cgroup the Ceph daemon has been assigned to?
Reducing the amount of configuration is one of our goals so if we can
make Ceph more aware of its environment as far as resource
constraints, we should go that route.

The MDS should self-configure mds_cache_memory_limit based on
memory.requests. That takes the magic formula out of the hands of
users and forgetful devs :)

I'm not sure we have any specific action on the POD_MEMORY_LIMIT value..
the OSD should really be aiming for the REQUEST value instead.
I agree we should focus on memory.requests.

I mentioned it in another doc, but I suspect it would be fairly easy to
adapt the osd_memory_limit and autotuner code to work in the mds so long
as we can adjust mds_cache_memory_limit on the fly.  It would be really
nice to have all of the daemons conform to standard *_memory_limit
interface.  That's useful both inside a container environment and on
bare metal.

Mark