Re: Ceph, container and memory

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 5 Mar 2019 16:40:44 +0000 (UTC)

On Mon, 4 Mar 2019, Sebastien Han wrote:
> Hi,
> 
> I'm writing this because I'm currently implementing memory tuning in
> Rook. The implementation is mostly based on the following POD
> properties:
> 
> * memory.limit: defines a hard cap for the memory, if the container
> tries to allocate more memory than the specified limit it gets
> terminated.
> * memory.request: used for scheduling only (and for OOM strategy when
> applying QoS)

I assume that memory.limit is always >= memory.request, right?

> If memory.requests is omitted for a container, it defaults to limits.
> If memory.limits is not set, it defaults to 0 (unbounded).
> If none of these 2 are specified then we don't tune anything because
> we don't really know what to do.
> 
> So far I've collected a couple of Ceph flags that are worth tuning:
> 
> * mds_cache_memory_limit
> * osd_memory_target
> 
> These flags will be passed at instantiation time for the MDS and the OSD daemon.
> Since most of the daemons have some cache flag, it'll be nice to unify
> them with a new option --{daemon}-memory-target.
> Currently I'm exposing POD properties via env var too that Ceph can
> consume later for more autotuning (POD_{MEMORY,CPU}_LIMIT,
> POD_{CPU,MEMORY}_REQUEST.

Ignoring mds_cache_memory_limit for now; I think we should wait until we 
have mds_memory_target before doing any magic there.

For the osd_memory_target, though, I think we could make the OSD pick up 
on the POD_MEMORY_REQUEST variable and, if present, set osd_memory_target 
to that value.  Or, instead of putting the burden on ceph, simply have 
rook pass --osd-memory-target on the command line, or (post-startup) do 
'ceph daemon osd.N config set osd_memory_target ...'.  (The advantage of 
the latter is that it can more easily be overridden at runtime.)

I'm not sure we have any specific action on the POD_MEMORY_LIMIT value.. 
the OSD should really be aiming for the REQUEST value instead.

> One other cool thing will be to report (when containerized) that the
> daemon cgroup memory limit is closed, so send something on "ceph -s"
> or ceph could re-adjust some of its internal values.
> As part, of that PR I'm also implementing failures based on
> memory.limit per daemon. So I need to know what's the minimum amount
> of memory we want to recommend in production. It's not an easy thing
> to do but we have to start somewhere.

This might be interesting, though.  I see a few possibilities:

1/ Any time the actual RSS is > 20% the target (or some other 
tunable multiplier), then we raise a health alert.  This is independent of 
the LIMIT value (or containers in general).

2/ We have rook set the above warning ratio based on how far apart LIMIT 
and REQUEST are.  Although if limit is like 4x request, that seems silly.  
Maybe we set the ratio to say min(1.5,LIMIT/REQUEST).

sage