Re: Ceph, container and memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 3/4/19 3:22 PM, Sebastien Han wrote:
Hi,

I'm writing this because I'm currently implementing memory tuning in
Rook. The implementation is mostly based on the following POD
properties:

* memory.limit: defines a hard cap for the memory, if the container
tries to allocate more memory than the specified limit it gets
terminated.
* memory.request: used for scheduling only (and for OOM strategy when
applying QoS)

If memory.requests is omitted for a container, it defaults to limits.
If memory.limits is not set, it defaults to 0 (unbounded).
If none of these 2 are specified then we don't tune anything because
we don't really know what to do.

So far I've collected a couple of Ceph flags that are worth tuning:

* mds_cache_memory_limit
* osd_memory_target

These flags will be passed at instantiation time for the MDS and the OSD daemon.
Since most of the daemons have some cache flag, it'll be nice to unify
them with a new option --{daemon}-memory-target.
Currently I'm exposing POD properties via env var too that Ceph can
consume later for more autotuning (POD_{MEMORY,CPU}_LIMIT,
POD_{CPU,MEMORY}_REQUEST.

One other cool thing will be to report (when containerized) that the
daemon cgroup memory limit is closed, so send something on "ceph -s"
or ceph could re-adjust some of its internal values.
As part, of that PR I'm also implementing failures based on
memory.limit per daemon. So I need to know what's the minimum amount
of memory we want to recommend in production. It's not an easy thing
to do but we have to start somewhere.


I wouldn't recommend setting the osd_memory_target below 2GB. The OSD will likely function but the autotuner may not be able to keep the mapped memory below the target without additional tuning (pglog length, rocksdb WAL buffer sizes, etc).  If you also lower these you can make the OSD fit within a relatively small memory footprint but it may be quite a bit slower at various things.  A 4GB target is enough to generally result in about 2-3GB being available for bluestore caches (assuming some fragmentation and general memory allocator inefficiency).  IE onodes, cached data, and rocksdb block cache.  a 2GB target likely will result in somewhere between cache_min (128MB) and 1GB of memory for caches. Below a 2GB target and it's likely the autotuner will mostly just assign cache_min and the osd_memory_target my be exceeded.


Mark


Thanks!
–––––––––
Sébastien Han
Principal Software Engineer, Storage Architect

"Always give 100%. Unless you're giving blood."



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux