[RFC Proposal] Deterministic memcg charging for shared memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Below is a proposal for deterministic charging of shared memory.
Please take a look and let me know if there are any major concerns:

Problem:
Currently shared memory is charged to the memcg of the allocating
process. This makes memory usage of processes accessing shared memory
a bit unpredictable since whichever process accesses the memory first
will get charged. We have a number of use cases where our userspace
would like deterministic charging of shared memory:

1. System services allocating memory for client jobs:
We have services (namely a network access service[1]) that provide
functionality for clients running on the machine and allocate memory
to carry out these services. The memory usage of these services
depends on the number of jobs running on the machine and the nature of
the requests made to the service, which makes the memory usage of
these services hard to predict and thus hard to limit via memory.max.
These system services would like a way to allocate memory and instruct
the kernel to charge this memory to the client’s memcg.

2. Shared filesystem between subtasks of a large job
Our infrastructure has large meta jobs such as kubernetes which spawn
multiple subtasks which share a tmpfs mount. These jobs and its
subtasks use that tmpfs mount for various purposes such as data
sharing or persistent data between the subtask restarts. In kubernetes
terminology, the meta job is similar to pods and subtasks are
containers under pods. We want the shared memory to be
deterministically charged to the kubernetes's pod and independent to
the lifetime of containers under the pod.

3. Shared libraries and language runtimes shared between independent jobs.
We’d like to optimize memory usage on the machine by sharing libraries
and language runtimes of many of the processes running on our machines
in separate memcgs. This produces a side effect that one job may be
unlucky to be the first to access many of the libraries and may get
oom killed as all the cached files get charged to it.

Design:
My rough proposal to solve this problem is to simply add a
‘memcg=/path/to/memcg’ mount option for filesystems (namely tmpfs):
directing all the memory of the file system to be ‘remote charged’ to
cgroup provided by that memcg= option.

Caveats:
1. One complication to address is the behavior when the target memcg
hits its memory.max limit because of remote charging. In this case the
oom-killer will be invoked, but the oom-killer may not find anything
to kill in the target memcg being charged. In this case, I propose
simply failing the remote charge which will cause the process
executing the remote charge to get an ENOMEM This will be documented
behavior of remote charging.
2. I would like to provide an initial implementation that adds this
support for tmpfs, while leaving the implementation generic enough for
myself or others to extend to more filesystems where they find the
feature useful.
3. I would like to implement this for both cgroups v2 _and_ cgroups
v1, as we still have cgroup v1 users. If this is unacceptable I can
provide the v2 implementation only, and maintain a local patch for the
v1 support.

If this proposal sounds good in principle. I have an experimental
implementation that I can make ready for review. Please let me know of
any concerns you may have. Thank you very much in advance!
Mina Almasry

[1] https://research.google/pubs/pub48630/





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux