Re: Ceph, container and memory

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 4 Mar 2019 13:33:53 -0800

On Mon, Mar 4, 2019 at 1:23 PM Sebastien Han <shan@xxxxxxxxxx> wrote:
>
> Hi,
>
> I'm writing this because I'm currently implementing memory tuning in
> Rook. The implementation is mostly based on the following POD
> properties:
>
> * memory.limit: defines a hard cap for the memory, if the container
> tries to allocate more memory than the specified limit it gets
> terminated.
> * memory.request: used for scheduling only (and for OOM strategy when
> applying QoS)
>
> If memory.requests is omitted for a container, it defaults to limits.
> If memory.limits is not set, it defaults to 0 (unbounded).
> If none of these 2 are specified then we don't tune anything because
> we don't really know what to do.
>
> So far I've collected a couple of Ceph flags that are worth tuning:
>
> * mds_cache_memory_limit
> * osd_memory_target
>
> These flags will be passed at instantiation time for the MDS and the OSD daemon.
> Since most of the daemons have some cache flag, it'll be nice to unify
> them with a new option --{daemon}-memory-target.
> Currently I'm exposing POD properties via env var too that Ceph can
> consume later for more autotuning (POD_{MEMORY,CPU}_LIMIT,
> POD_{CPU,MEMORY}_REQUEST.

Hmm, these names differ for a reason. The osd_memory_target is an
actual OSD target (although it's quite limited — the only real knob is
the bluestore cache sizes), whereas the mds_cache_memory_limit tries
to control the cache size but does not look at the total MDS memory
usage. There's a formula for roughly how much memory you can expect to
actually be used, but I forget what it is.

>
> One other cool thing will be to report (when containerized) that the
> daemon cgroup memory limit is closed, so send something on "ceph -s"
> or ceph could re-adjust some of its internal values.
> As part, of that PR I'm also implementing failures based on
> memory.limit per daemon. So I need to know what's the minimum amount
> of memory we want to recommend in production. It's not an easy thing
> to do but we have to start somewhere.

I...think the defaults we already have are as close to a "universal"
recommendation as we can get. This needs to be easy to configure since
it will change based on expected use case.
-Greg

>
> Thanks!
> –––––––––
> Sébastien Han
> Principal Software Engineer, Storage Architect
>
> "Always give 100%. Unless you're giving blood."