On Tue, Jul 12, 2022 at 11:11 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > Ccing Mina who actually worked on upstreaming this. See [1] for > previous discussion and more use-cases. > > [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@xxxxxxxxxx/ > > On Tue, Jul 12, 2022 at 10:36 AM Tejun Heo <tj@xxxxxxxxxx> wrote: > > > > Hello, > > > > On Tue, Jul 12, 2022 at 10:26:22AM -0700, Shakeel Butt wrote: > > > One use-case we have is a build & test service which runs independent > > > builds and tests but all the build utilities (compiler, linker, > > > libraries) are shared between those builds and tests. > > > > > > In terms of topology, the service has a top level cgroup (P) and all > > > independent builds and tests run in their own cgroup under P. These > > > builds/tests continuously come and go. > > > > > > This service continuously monitors all the builds/tests running and > > > may kill some based on some criteria which includes memory usage. > > > However the memory usage is nondeterministic and killing a specific > > > build/test may not really free memory if most of the memory charged to > > > it is from shared build utilities. > > > > That doesn't sound too unusual. So, one saving grace here is that the memory > > pressure in the stressed cgroup should trigger reclaim of the shared memory > > which will be likely picked up by someone else, hopefully, under less memory > > pressure. Can you give more concerete details? ie. describe a failing > > scenario with actual ballpark memory numbers? > > Mina, can you please provide details requested by Tejun? > As far as I am aware the builds/tests service Shakeel mentioned is a theoretical use case we're considering, but the actual use cases we're running are the 3 I listed in my cover letter in my original proposal: https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@xxxxxxxxxx/ Still, the use case Shakeel is talking about is almost identical to use case #2 in that proposal: "Our infrastructure has large meta jobs such as kubernetes which spawn multiple subtasks which share a tmpfs mount. These jobs and its subtasks use that tmpfs mount for various purposes such as data sharing or persistent data between the subtask restarts. In kubernetes terminology, the meta job is similar to pods and subtasks are containers under pods. We want the shared memory to be deterministically charged to the kubernetes's pod and independent to the lifetime of containers under the pod." To run such a job we do the following: - We setup a hierarchy like so: pod_container / | \ container_a container_b container_c - We set up a tmpfs mount with memcg= pod_container. This instructs the kernel to charge all of this tmpfs user data to pod_container, instead of the memcg of the task which faults in the shared memory. - We set up the pod_container.max to be the maximum amount of memory allowed to the _entire_ job. - We set up container_a.max, container_b.max, and container_c.max to be the limit of each of sub-tasks a, b, and c respectively, not including the shared memory, which is allocated via the tmpfs mount and charged directly to pod_container. For some rough numbers, you can imagine a scenario: tmpfs memcg=pod_container,size=100MB pod_container.max=130MB / | \ container_a.max=10MB container_b.max=20MB container_c.max=30MB Thanks to memcg=pod_container, neither tasks a, b, and c are charged for the shared memory, so they can stay within their 10MB, 20MB, and 30MB limits respectively. This gives us fine grained control to deterministically charge the shared memory and apply limits on the memory usage of the individual sub-tasks and the overall amount of memory the entire pod should consume. For transparency's sake, this is Johannes's comments on the API: https://lore.kernel.org/linux-mm/YZvppKvUPTIytM%2Fc@xxxxxxxxxxx/ As Tejun puts it: "it may make sense to have a way to escape certain resources to an ancestor for shared resources provided that we can come up with a sane interface" The interface Johannes has opted for is to reparent memory to the common ancestor _when it is accessed by a task in another memcg_. This doesn't work for us for a few reasons, one of which in the example above container_a may get charged for all the 100MB of shared memory if it's the unlucky task that faults in all the shared memory. > > > > FWIW, at least from generic resource constrol standpoint, I think it may > > make sense to have a way to escape certain resources to an ancestor for > > shared resources provided that we can come up with a sane interface. > > > > Thanks. > > > > -- > > tejun