Re: Ceph, container and memory

Sebastien Han <shan@xxxxxxxxxx> · Tue, 5 Mar 2019 13:30:35 +0100

Ok, I guess I'll do my best but in the future, it'll be nice having
Ceph being more aware of the environment it's running on and tune its
interval flags appropriately.
For instance, Ceph can already understand it runs inside a container.
The next step is to do more introspection of this container
environment:

* check the CPU allocated (request and limit)
* check the memory allocated via cgroup (request and limit)

Then the daemon can auto-tune its knobs based on what was read.
And again, raise an alert when the container is running low on memory.

Thanks!
–––––––––
Sébastien Han
Principal Software Engineer, Storage Architect

"Always give 100%. Unless you're giving blood."

On Mon, Mar 4, 2019 at 10:45 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>
>
> On 3/4/19 3:22 PM, Sebastien Han wrote:
> > Hi,
> >
> > I'm writing this because I'm currently implementing memory tuning in
> > Rook. The implementation is mostly based on the following POD
> > properties:
> >
> > * memory.limit: defines a hard cap for the memory, if the container
> > tries to allocate more memory than the specified limit it gets
> > terminated.
> > * memory.request: used for scheduling only (and for OOM strategy when
> > applying QoS)
> >
> > If memory.requests is omitted for a container, it defaults to limits.
> > If memory.limits is not set, it defaults to 0 (unbounded).
> > If none of these 2 are specified then we don't tune anything because
> > we don't really know what to do.
> >
> > So far I've collected a couple of Ceph flags that are worth tuning:
> >
> > * mds_cache_memory_limit
> > * osd_memory_target
> >
> > These flags will be passed at instantiation time for the MDS and the OSD daemon.
> > Since most of the daemons have some cache flag, it'll be nice to unify
> > them with a new option --{daemon}-memory-target.
> > Currently I'm exposing POD properties via env var too that Ceph can
> > consume later for more autotuning (POD_{MEMORY,CPU}_LIMIT,
> > POD_{CPU,MEMORY}_REQUEST.
> >
> > One other cool thing will be to report (when containerized) that the
> > daemon cgroup memory limit is closed, so send something on "ceph -s"
> > or ceph could re-adjust some of its internal values.
> > As part, of that PR I'm also implementing failures based on
> > memory.limit per daemon. So I need to know what's the minimum amount
> > of memory we want to recommend in production. It's not an easy thing
> > to do but we have to start somewhere.
>
>
> I wouldn't recommend setting the osd_memory_target below 2GB. The OSD
> will likely function but the autotuner may not be able to keep the
> mapped memory below the target without additional tuning (pglog length,
> rocksdb WAL buffer sizes, etc).  If you also lower these you can make
> the OSD fit within a relatively small memory footprint but it may be
> quite a bit slower at various things.  A 4GB target is enough to
> generally result in about 2-3GB being available for bluestore caches
> (assuming some fragmentation and general memory allocator
> inefficiency).  IE onodes, cached data, and rocksdb block cache.  a 2GB
> target likely will result in somewhere between cache_min (128MB) and 1GB
> of memory for caches. Below a 2GB target and it's likely the autotuner
> will mostly just assign cache_min and the osd_memory_target my be exceeded.
>
>
> Mark
>
> >
> > Thanks!
> > –––––––––
> > Sébastien Han
> > Principal Software Engineer, Storage Architect
> >
> > "Always give 100%. Unless you're giving blood."