Re: [RFC Proposal] Deterministic memcg charging for shared memory

Mina Almasry <almasrymina@xxxxxxxxxx> · Wed, 20 Oct 2021 10:46:02 -0700

On Wed, Oct 20, 2021 at 2:09 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Mon 18-10-21 07:31:58, Mina Almasry wrote:
> > On Mon, Oct 18, 2021 at 6:33 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > >
> > > On Wed 13-10-21 12:23:19, Mina Almasry wrote:
> > > > Below is a proposal for deterministic charging of shared memory.
> > > > Please take a look and let me know if there are any major concerns:
> > > >
> > > > Problem:
> > > > Currently shared memory is charged to the memcg of the allocating
> > > > process. This makes memory usage of processes accessing shared memory
> > > > a bit unpredictable since whichever process accesses the memory first
> > > > will get charged. We have a number of use cases where our userspace
> > > > would like deterministic charging of shared memory:
> > > >
> > > > 1. System services allocating memory for client jobs:
> > > > We have services (namely a network access service[1]) that provide
> > > > functionality for clients running on the machine and allocate memory
> > > > to carry out these services. The memory usage of these services
> > > > depends on the number of jobs running on the machine and the nature of
> > > > the requests made to the service, which makes the memory usage of
> > > > these services hard to predict and thus hard to limit via memory.max.
> > > > These system services would like a way to allocate memory and instruct
> > > > the kernel to charge this memory to the client’s memcg.
> > > >
> > > > 2. Shared filesystem between subtasks of a large job
> > > > Our infrastructure has large meta jobs such as kubernetes which spawn
> > > > multiple subtasks which share a tmpfs mount. These jobs and its
> > > > subtasks use that tmpfs mount for various purposes such as data
> > > > sharing or persistent data between the subtask restarts. In kubernetes
> > > > terminology, the meta job is similar to pods and subtasks are
> > > > containers under pods. We want the shared memory to be
> > > > deterministically charged to the kubernetes's pod and independent to
> > > > the lifetime of containers under the pod.
> > > >
> > > > 3. Shared libraries and language runtimes shared between independent jobs.
> > > > We’d like to optimize memory usage on the machine by sharing libraries
> > > > and language runtimes of many of the processes running on our machines
> > > > in separate memcgs. This produces a side effect that one job may be
> > > > unlucky to be the first to access many of the libraries and may get
> > > > oom killed as all the cached files get charged to it.
> > > >
> > > > Design:
> > > > My rough proposal to solve this problem is to simply add a
> > > > ‘memcg=/path/to/memcg’ mount option for filesystems (namely tmpfs):
> > > > directing all the memory of the file system to be ‘remote charged’ to
> > > > cgroup provided by that memcg= option.
> > >
> > > Could you be more specific about how this matches the above mentioned
> > > usecases?
> > >
> >
> > For the use cases I've listed respectively:
> > 1. Our network service would mount a tmpfs with 'memcg=<path to
> > client's memcg>'. Any memory the service is allocating on behalf of
> > the client, the service will allocate inside of this tmpfs mount, thus
> > charging it to the client's memcg without risk of hitting the
> > service's limit.
> > 2. The large job (kubernetes pod) would mount a tmpfs with
> > 'memcg=<path to large job's memcg>. It will then share this tmpfs
> > mount with the subtasks (containers in the pod). The subtasks can then
> > allocate memory in the tmpfs, having it charged to the kubernetes job,
> > without risk of hitting the container's limit.
>
> There is still a risk that the limit is hit for the memcg of shmem
> owner, right? What happens then? Isn't any of the shmem consumer a
> DoS attack vector for everybody else consuming from that same target
> memcg?

This is an interesting point and thanks for bringing it up. I think
there are a couple of things we can do about that:

1. Only allow root to mount a tmpfs with memcg=<anything>
2. Only processes allowed the enter cgroup at mount time can mount a
tmpfs with memcg=<cgroup>

Both address this DoS attack vector adequately I think. I have a
strong preference to solution (2) as it keeps the feature useful for
non-admins while still completely addressing the concern AFAICT.

> In other words aren't all of them in the same memory resource
> domain effectively? If we allow target memcg to live outside of that
> resource domain then this opens interesting questions about resource
> control in general, no? Something the unified hierarchy was aiming to
> fix wrt cgroup v1.
>

In my very humble opinion there are valid reasons for processes inside
the same resource domain to want the memory to be charged to one of
them and not the others, and there are valid reasons for the target
memcg living outside the resource domain without really fundamentally
breaking resource control. So breaking it down by use case:

W.r.t. usecase #1: our network service needs to allocate memory to
provide networking services to any job running on the system,
regardless of which cgroup that job is in. You could say it's
allocating memory to a cgroup outside its resource domain but IMHO
it's a bit of a grey area. After all the network service is servicing
network requests for jobs inside of that resource domain, and
allocating memory on behalf of those jobs and not for its 'own'
purposes in a sense.

W.r.t. usecase #2: the large jobs (kubernetes) and all its sub-tasks
(pods) are all in the same resource domain, but we still would like
the shared memory deterministically charged to the large jobs and
doesn't end up as part of the sub-task usage. I don't have concrete
numbers (I'll work on getting them), but one can image a large job
which owns a 10GB tmpfs mount. The sub-tasks say each use at most
10MB, but they need access to the shared 10GB tmpfs mount. In that
case the sub-tasks will get charged for the 10MB they privately use
and anywhere between 0-10GB for the shared memory. In situations like
these we'd like to enforce that subtasks only use 10MB of 'their'
memory and that the total memory usage of the entire job doesn't use
memory beyond a certain limit, say 10.5GB or something. There is no
way to do that today AFAIK, but with deterministic shared memory
charging this is possible.

> You are saying that it is hard to properly set limits for
> respective services but this would simply allow to hide a part of the
> consumption somewhere else. Aren't you just shifting the problem
> elsewhere? How do configure the target memcg?
>
> Do you have any numbers about the consumption variation and how big of a
> problem that is in practice?
>

So for use case #1 I have concrete numbers. Our network service
roughly allocates around 75MB/VM running on the machine. The exact
number depends on the network activity of the VM and this is just a
rough estimate. There are anywhere between 0-128 VMs running on a the
machine. So the memory usage of the network service can be anywhere
between almost 0 to 9.6GB. Statically deciding the memory limit of the
network service's cgroup is not possible. Increasing/decreasing the
limit when new VMs are scheduled on the machine is also not very
workable, as the amount of memory the service uses depends on the
network activity of the VM. Getting the limit wrong has disastrous
consequences as the service is unable to serve the entire machine.
With memcg= this becomes a trivial problem as each VM pads its memory
usage with extra headroom for network activity. If the headroom is
insufficient only that one VM is deprived of network services, not the
entire machine.

W.r.t usecase #2, I don't have concrete numbers, but I've outlined
above one (hopefully) reasonable scenario where we'd have an issue.

> > 3. We would need to extend this functionality to other file systems of
> > persistent disk, then mount that file system with 'memcg=<dedicated
> > shared library memcg>'. Jobs can then use the shared library and any
> > memory allocated due to loading the shared library is charged to a
> > dedicated memcg, and not charged to the job using the shared library.
>
> This is more of a question for fs people. My understanding is rather
> limited so I cannot even imagine all the possible setups but just from
> a very high level understanding bind mounts can get really interesting.
> Can those disagree on the memcg?
>

I will admit I've thought about the tmpfs support as much as possible,
and that extending the support to other file system will generate more
concerns, maybe even concerns specific to that file system. I'm hoping
to build support for tmpfs and leave the implementation generic enough
to be extended afterwards.

I haven't thought of it thoroughly but I would imagine bind mounts
shouldn't be able to disagree on the memcg. At least right now I can't
think of a compelling use case for that.

> I am pretty sure I didn't get to think through this very deeply, my gut
> feeling tells me that this will open many interesting questions and I am
> not sure whether it solves more problems than it introduces at this moment.
> I would be really curious what others think about this.

I would also obviously love to hear opinions of others (and thanks
Michal for taking the time to review). I think I'll work on posting a
patch series to the list and hope that solicits more opinions.

> --
> Michal Hocko
> SUSE Labs