On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote: > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote: > > Problem: > > Currently shared memory is charged to the memcg of the allocating > > process. This makes memory usage of processes accessing shared memory > > a bit unpredictable since whichever process accesses the memory first > > will get charged. We have a number of use cases where our userspace > > would like deterministic charging of shared memory: > > > > 1. System services allocating memory for client jobs: > > We have services (namely a network access service[1]) that provide > > functionality for clients running on the machine and allocate memory > > to carry out these services. The memory usage of these services > > depends on the number of jobs running on the machine and the nature of > > the requests made to the service, which makes the memory usage of > > these services hard to predict and thus hard to limit via memory.max. > > These system services would like a way to allocate memory and instruct > > the kernel to charge this memory to the client’s memcg. > > > > 2. Shared filesystem between subtasks of a large job > > Our infrastructure has large meta jobs such as kubernetes which spawn > > multiple subtasks which share a tmpfs mount. These jobs and its > > subtasks use that tmpfs mount for various purposes such as data > > sharing or persistent data between the subtask restarts. In kubernetes > > terminology, the meta job is similar to pods and subtasks are > > containers under pods. We want the shared memory to be > > deterministically charged to the kubernetes's pod and independent to > > the lifetime of containers under the pod. > > > > 3. Shared libraries and language runtimes shared between independent jobs. > > We’d like to optimize memory usage on the machine by sharing libraries > > and language runtimes of many of the processes running on our machines > > in separate memcgs. This produces a side effect that one job may be > > unlucky to be the first to access many of the libraries and may get > > oom killed as all the cached files get charged to it. > > > > Design: > > My rough proposal to solve this problem is to simply add a > > ‘memcg=/path/to/memcg’ mount option for filesystems: > > directing all the memory of the file system to be ‘remote charged’ to > > cgroup provided by that memcg= option. > > > > Caveats: > > > > 1. One complication to address is the behavior when the target memcg > > hits its memory.max limit because of remote charging. In this case the > > oom-killer will be invoked, but the oom-killer may not find anything > > to kill in the target memcg being charged. Thera are a number of considerations > > in this case: > > > > 1. It's not great to kill the allocating process since the allocating process > > is not running in the memcg under oom, and killing it will not free memory > > in the memcg under oom. > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault > > somehow. If not, the process will forever loop the pagefault in the upstream > > kernel. > > > > In this case, I propose simply failing the remote charge and returning an ENOSPC > > to the caller. This will cause will cause the process executing the remote > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault > > path. This will be documented behavior of remote charging, and this feature is > > opt-in. Users can: > > - Not opt-into the feature if they want. > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and > > abort if they desire. > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their > > operation without executing the remote charge if possible. > > > > 2. Only processes allowed the enter cgroup at mount time can mount a > > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups > > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any > > process with write access to this mount point will be able to charge memory to > > <cgroup>. This is largely a non-issue because in configurations where there is > > untrusted code running on the machine, mount point access needs to be > > restricted to the intended users only regardless of whether the mount point > > memory is deterministly charged or not. > > I'm not a fan of this. It uses filesystem mounts to create shareable > resource domains outside of the cgroup hierarchy, which has all the > downsides you listed, and more: > > 1. You need a filesystem interface in the first place, and a new > ad-hoc channel and permission model to coordinate with the cgroup > tree, which isn't great. All filesystems you want to share data on > need to be converted. > > 2. It doesn't extend to non-filesystem sources of shared data, such as > memfds, ipc shm etc. > > 3. It requires unintuitive configuration for what should be basic > shared accounting semantics. Per default you still get the old > 'first touch' semantics, but to get sharing you need to reconfigure > the filesystems? > > 4. If a task needs to work with a hierarchy of data sharing domains - > system-wide, group of jobs, job - it must interact with a hierarchy > of filesystem mounts. This is a pain to setup and may require task > awareness. Moving data around, working with different mount points. > Also, no shared and private data accounting within the same file. > > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are > entangled, sitting in disjunct domains. OOM killing is one quirk of > that, but there are others you haven't touched on. Who is charged > for the CPU cycles of reclaim in the out-of-band domain? Who is > charged for the paging IO? How is resource pressure accounted and > attributed? Soon you need cpu= and io= as well. > > My take on this is that it might work for your rather specific > usecase, but it doesn't strike me as a general-purpose feature > suitable for upstream. > > > If we want sharing semantics for memory, I think we need a more > generic implementation with a cleaner interface. > > Here is one idea: > > Have you considered reparenting pages that are accessed by multiple > cgroups to the first common ancestor of those groups? > > Essentially, whenever there is a memory access (minor fault, buffered > IO) to a page that doesn't belong to the accessing task's cgroup, you > find the common ancestor between that task and the owning cgroup, and > move the page there. > > With a tree like this: > > root - job group - job > `- job > `- job group - job > `- job > > all pages accessed inside that tree will propagate to the highest > level at which they are shared - which is the same level where you'd > also set shared policies, like a job group memory limit or io weight. > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs > pages would bubble to the respective job group, private data would > stay within each job. > > No further user configuration necessary. Although you still *can* use > mount namespacing etc. to prohibit undesired sharing between cgroups. > > The actual user-visible accounting change would be quite small, and > arguably much more intuitive. Remember that accounting is recursive, > meaning that a job page today also shows up in the counters of job > group and root. This would not change. The only thing that IS weird > today is that when two jobs share a page, it will arbitrarily show up > in one job's counter but not in the other's. That would change: it > would no longer show up as either, since it's not private to either; > it would just be a job group (and up) page. In general I like the idea, but I think the user-visible change will be quite large, almost "cgroup v3"-large. Here are some problems: 1) Anything shared between e.g. system.slice and user.slice now belongs to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache belonging to shared libraries. 2) It's concerning in security terms. If I understand the idea correctly, a read-only access will allow to move charges to an upper level, potentially crossing memory.max limits. It doesn't sound safe. 3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent it returns us to the cgroup v1 world and a question of competition between resources consumed by a cgroup directly and through children cgroups. Not like the problem doesn't exist now, but it's less pronounced. If say >50% of system.slice's memory will belong to system.slice directly, then we likely will need separate non-recursive counters, limits, protections, etc. 4) Imagine a production server and a system administrator entering using ssh (and being put into user.slice) and running a big grep... It screws up all memory accounting until a next reboot. Not a completely impossible scenario. That said, I agree with Johannes and I'm also not a big fan of this patchset. I agree that the problem exist and that the patchset provides a solution, but it doesn't look nice (and generic enough) and creates a lot of questions and corner cases. Btw, won't (an optional) disabling of memcg accounting for a tmpfs solve your problem? It will be less invasive and will not require any oom changes.