Re: Ceph, container and memory

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 5 Mar 2019 10:18:11 -0600

On 3/5/19 6:30 AM, Sebastien Han wrote:
Ok, I guess I'll do my best but in the future, it'll be nice having
Ceph being more aware of the environment it's running on and tune its
interval flags appropriately.
For instance, Ceph can already understand it runs inside a container.
The next step is to do more introspection of this container
environment:

* check the CPU allocated (request and limit)

There's likely some interaction here with Sage's recent numa pinning 
work.  IE we probably want to migrate away from having the user define 
the number of threads/shards and more toward telling the OSD how many 
cores it has available and whether or not the OSD should try to do any 
numa pinning (Not exactly sure how this will work in the container world).

* check the memory allocated via cgroup (request and limit)

Then the daemon can auto-tune its knobs based on what was read.
And again, raise an alert when the container is running low on memory.

Right now we check the target against the amount of mapped memory as 
determined by tcmalloc.  There's some really strange interactions you 
can run into when you are trying to balance against broader targets like 
RSS memory usage.  IE sometimes de-allocating memory doesn't actually 
result in memory being freed by the kernel (transparent huge pages and 
general memory pressure can affect this).  Then you have to decide: Do a 
I keep deallocating memory, or do I wait and see if the memory 
eventually gets freed?  This can lead to bad behavior either way.  On 
one hand you end up operating over your threshold waiting to see what 
happens and on the other hand you can end up in thrashing cycles where 
you over-deallocate memory with no immediate change but eventually huge 
amounts of memory get freed resulting in re-allocation starting the 
whole cycle all over again.  It might be that we don't run into issues 
like this just looking at the amount of container memory used, but we 
are going to have to be very careful about how we approach this or we 
could run into some nasty corner cases.  For now I think we probably 
want to stick with looking at tcmalloc mapped memory as our target until 
we know that a different statistic isn't prone to this kind of behavior.

Mark

Thanks!
–––––––––
Sébastien Han
Principal Software Engineer, Storage Architect

"Always give 100%. Unless you're giving blood."

On Mon, Mar 4, 2019 at 10:45 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:

On 3/4/19 3:22 PM, Sebastien Han wrote:
Hi,

I'm writing this because I'm currently implementing memory tuning in
Rook. The implementation is mostly based on the following POD
properties:

* memory.limit: defines a hard cap for the memory, if the container
tries to allocate more memory than the specified limit it gets
terminated.
* memory.request: used for scheduling only (and for OOM strategy when
applying QoS)

If memory.requests is omitted for a container, it defaults to limits.
If memory.limits is not set, it defaults to 0 (unbounded).
If none of these 2 are specified then we don't tune anything because
we don't really know what to do.

So far I've collected a couple of Ceph flags that are worth tuning:

* mds_cache_memory_limit
* osd_memory_target

These flags will be passed at instantiation time for the MDS and the OSD daemon.
Since most of the daemons have some cache flag, it'll be nice to unify
them with a new option --{daemon}-memory-target.
Currently I'm exposing POD properties via env var too that Ceph can
consume later for more autotuning (POD_{MEMORY,CPU}_LIMIT,
POD_{CPU,MEMORY}_REQUEST.

One other cool thing will be to report (when containerized) that the
daemon cgroup memory limit is closed, so send something on "ceph -s"
or ceph could re-adjust some of its internal values.
As part, of that PR I'm also implementing failures based on
memory.limit per daemon. So I need to know what's the minimum amount
of memory we want to recommend in production. It's not an easy thing
to do but we have to start somewhere.

I wouldn't recommend setting the osd_memory_target below 2GB. The OSD
will likely function but the autotuner may not be able to keep the
mapped memory below the target without additional tuning (pglog length,
rocksdb WAL buffer sizes, etc).  If you also lower these you can make
the OSD fit within a relatively small memory footprint but it may be
quite a bit slower at various things.  A 4GB target is enough to
generally result in about 2-3GB being available for bluestore caches
(assuming some fragmentation and general memory allocator
inefficiency).  IE onodes, cached data, and rocksdb block cache.  a 2GB
target likely will result in somewhere between cache_min (128MB) and 1GB
of memory for caches. Below a 2GB target and it's likely the autotuner
will mostly just assign cache_min and the osd_memory_target my be exceeded.

Mark

Thanks!
–––––––––
Sébastien Han
Principal Software Engineer, Storage Architect

"Always give 100%. Unless you're giving blood."