On Wed, Oct 20, 2021 at 2:09 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > On Mon 18-10-21 07:31:58, Mina Almasry wrote: > > On Mon, Oct 18, 2021 at 6:33 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > > > On Wed 13-10-21 12:23:19, Mina Almasry wrote: > > > > Below is a proposal for deterministic charging of shared memory. > > > > Please take a look and let me know if there are any major concerns: > > > > > > > > Problem: > > > > Currently shared memory is charged to the memcg of the allocating > > > > process. This makes memory usage of processes accessing shared memory > > > > a bit unpredictable since whichever process accesses the memory first > > > > will get charged. We have a number of use cases where our userspace > > > > would like deterministic charging of shared memory: > > > > > > > > 1. System services allocating memory for client jobs: > > > > We have services (namely a network access service[1]) that provide > > > > functionality for clients running on the machine and allocate memory > > > > to carry out these services. The memory usage of these services > > > > depends on the number of jobs running on the machine and the nature of > > > > the requests made to the service, which makes the memory usage of > > > > these services hard to predict and thus hard to limit via memory.max. > > > > These system services would like a way to allocate memory and instruct > > > > the kernel to charge this memory to the client’s memcg. > > > > > > > > 2. Shared filesystem between subtasks of a large job > > > > Our infrastructure has large meta jobs such as kubernetes which spawn > > > > multiple subtasks which share a tmpfs mount. These jobs and its > > > > subtasks use that tmpfs mount for various purposes such as data > > > > sharing or persistent data between the subtask restarts. In kubernetes > > > > terminology, the meta job is similar to pods and subtasks are > > > > containers under pods. We want the shared memory to be > > > > deterministically charged to the kubernetes's pod and independent to > > > > the lifetime of containers under the pod. > > > > > > > > 3. Shared libraries and language runtimes shared between independent jobs. > > > > We’d like to optimize memory usage on the machine by sharing libraries > > > > and language runtimes of many of the processes running on our machines > > > > in separate memcgs. This produces a side effect that one job may be > > > > unlucky to be the first to access many of the libraries and may get > > > > oom killed as all the cached files get charged to it. > > > > > > > > Design: > > > > My rough proposal to solve this problem is to simply add a > > > > ‘memcg=/path/to/memcg’ mount option for filesystems (namely tmpfs): > > > > directing all the memory of the file system to be ‘remote charged’ to > > > > cgroup provided by that memcg= option. > > > > > > Could you be more specific about how this matches the above mentioned > > > usecases? > > > > > > > For the use cases I've listed respectively: > > 1. Our network service would mount a tmpfs with 'memcg=<path to > > client's memcg>'. Any memory the service is allocating on behalf of > > the client, the service will allocate inside of this tmpfs mount, thus > > charging it to the client's memcg without risk of hitting the > > service's limit. > > 2. The large job (kubernetes pod) would mount a tmpfs with > > 'memcg=<path to large job's memcg>. It will then share this tmpfs > > mount with the subtasks (containers in the pod). The subtasks can then > > allocate memory in the tmpfs, having it charged to the kubernetes job, > > without risk of hitting the container's limit. > > There is still a risk that the limit is hit for the memcg of shmem > owner, right? What happens then? Isn't any of the shmem consumer a > DoS attack vector for everybody else consuming from that same target > memcg? This is an interesting point and thanks for bringing it up. I think there are a couple of things we can do about that: 1. Only allow root to mount a tmpfs with memcg=<anything> 2. Only processes allowed the enter cgroup at mount time can mount a tmpfs with memcg=<cgroup> Both address this DoS attack vector adequately I think. I have a strong preference to solution (2) as it keeps the feature useful for non-admins while still completely addressing the concern AFAICT. > In other words aren't all of them in the same memory resource > domain effectively? If we allow target memcg to live outside of that > resource domain then this opens interesting questions about resource > control in general, no? Something the unified hierarchy was aiming to > fix wrt cgroup v1. > In my very humble opinion there are valid reasons for processes inside the same resource domain to want the memory to be charged to one of them and not the others, and there are valid reasons for the target memcg living outside the resource domain without really fundamentally breaking resource control. So breaking it down by use case: W.r.t. usecase #1: our network service needs to allocate memory to provide networking services to any job running on the system, regardless of which cgroup that job is in. You could say it's allocating memory to a cgroup outside its resource domain but IMHO it's a bit of a grey area. After all the network service is servicing network requests for jobs inside of that resource domain, and allocating memory on behalf of those jobs and not for its 'own' purposes in a sense. W.r.t. usecase #2: the large jobs (kubernetes) and all its sub-tasks (pods) are all in the same resource domain, but we still would like the shared memory deterministically charged to the large jobs and doesn't end up as part of the sub-task usage. I don't have concrete numbers (I'll work on getting them), but one can image a large job which owns a 10GB tmpfs mount. The sub-tasks say each use at most 10MB, but they need access to the shared 10GB tmpfs mount. In that case the sub-tasks will get charged for the 10MB they privately use and anywhere between 0-10GB for the shared memory. In situations like these we'd like to enforce that subtasks only use 10MB of 'their' memory and that the total memory usage of the entire job doesn't use memory beyond a certain limit, say 10.5GB or something. There is no way to do that today AFAIK, but with deterministic shared memory charging this is possible. > You are saying that it is hard to properly set limits for > respective services but this would simply allow to hide a part of the > consumption somewhere else. Aren't you just shifting the problem > elsewhere? How do configure the target memcg? > > Do you have any numbers about the consumption variation and how big of a > problem that is in practice? > So for use case #1 I have concrete numbers. Our network service roughly allocates around 75MB/VM running on the machine. The exact number depends on the network activity of the VM and this is just a rough estimate. There are anywhere between 0-128 VMs running on a the machine. So the memory usage of the network service can be anywhere between almost 0 to 9.6GB. Statically deciding the memory limit of the network service's cgroup is not possible. Increasing/decreasing the limit when new VMs are scheduled on the machine is also not very workable, as the amount of memory the service uses depends on the network activity of the VM. Getting the limit wrong has disastrous consequences as the service is unable to serve the entire machine. With memcg= this becomes a trivial problem as each VM pads its memory usage with extra headroom for network activity. If the headroom is insufficient only that one VM is deprived of network services, not the entire machine. W.r.t usecase #2, I don't have concrete numbers, but I've outlined above one (hopefully) reasonable scenario where we'd have an issue. > > 3. We would need to extend this functionality to other file systems of > > persistent disk, then mount that file system with 'memcg=<dedicated > > shared library memcg>'. Jobs can then use the shared library and any > > memory allocated due to loading the shared library is charged to a > > dedicated memcg, and not charged to the job using the shared library. > > This is more of a question for fs people. My understanding is rather > limited so I cannot even imagine all the possible setups but just from > a very high level understanding bind mounts can get really interesting. > Can those disagree on the memcg? > I will admit I've thought about the tmpfs support as much as possible, and that extending the support to other file system will generate more concerns, maybe even concerns specific to that file system. I'm hoping to build support for tmpfs and leave the implementation generic enough to be extended afterwards. I haven't thought of it thoroughly but I would imagine bind mounts shouldn't be able to disagree on the memcg. At least right now I can't think of a compelling use case for that. > I am pretty sure I didn't get to think through this very deeply, my gut > feeling tells me that this will open many interesting questions and I am > not sure whether it solves more problems than it introduces at this moment. > I would be really curious what others think about this. I would also obviously love to hear opinions of others (and thanks Michal for taking the time to review). I think I'll work on posting a patch series to the list and hope that solicits more opinions. > -- > Michal Hocko > SUSE Labs