On Tue, Nov 23, 2021 at 01:19:47PM -0800, Mina Almasry wrote: > On Tue, Nov 23, 2021 at 12:21 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > On Mon, Nov 22, 2021 at 03:09:26PM -0800, Roman Gushchin wrote: > > > On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote: > > > > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote: > > > > > Problem: > > > > > Currently shared memory is charged to the memcg of the allocating > > > > > process. This makes memory usage of processes accessing shared memory > > > > > a bit unpredictable since whichever process accesses the memory first > > > > > will get charged. We have a number of use cases where our userspace > > > > > would like deterministic charging of shared memory: > > > > > > > > > > 1. System services allocating memory for client jobs: > > > > > We have services (namely a network access service[1]) that provide > > > > > functionality for clients running on the machine and allocate memory > > > > > to carry out these services. The memory usage of these services > > > > > depends on the number of jobs running on the machine and the nature of > > > > > the requests made to the service, which makes the memory usage of > > > > > these services hard to predict and thus hard to limit via memory.max. > > > > > These system services would like a way to allocate memory and instruct > > > > > the kernel to charge this memory to the client’s memcg. > > > > > > > > > > 2. Shared filesystem between subtasks of a large job > > > > > Our infrastructure has large meta jobs such as kubernetes which spawn > > > > > multiple subtasks which share a tmpfs mount. These jobs and its > > > > > subtasks use that tmpfs mount for various purposes such as data > > > > > sharing or persistent data between the subtask restarts. In kubernetes > > > > > terminology, the meta job is similar to pods and subtasks are > > > > > containers under pods. We want the shared memory to be > > > > > deterministically charged to the kubernetes's pod and independent to > > > > > the lifetime of containers under the pod. > > > > > > > > > > 3. Shared libraries and language runtimes shared between independent jobs. > > > > > We’d like to optimize memory usage on the machine by sharing libraries > > > > > and language runtimes of many of the processes running on our machines > > > > > in separate memcgs. This produces a side effect that one job may be > > > > > unlucky to be the first to access many of the libraries and may get > > > > > oom killed as all the cached files get charged to it. > > > > > > > > > > Design: > > > > > My rough proposal to solve this problem is to simply add a > > > > > ‘memcg=/path/to/memcg’ mount option for filesystems: > > > > > directing all the memory of the file system to be ‘remote charged’ to > > > > > cgroup provided by that memcg= option. > > > > > > > > > > Caveats: > > > > > > > > > > 1. One complication to address is the behavior when the target memcg > > > > > hits its memory.max limit because of remote charging. In this case the > > > > > oom-killer will be invoked, but the oom-killer may not find anything > > > > > to kill in the target memcg being charged. Thera are a number of considerations > > > > > in this case: > > > > > > > > > > 1. It's not great to kill the allocating process since the allocating process > > > > > is not running in the memcg under oom, and killing it will not free memory > > > > > in the memcg under oom. > > > > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault > > > > > somehow. If not, the process will forever loop the pagefault in the upstream > > > > > kernel. > > > > > > > > > > In this case, I propose simply failing the remote charge and returning an ENOSPC > > > > > to the caller. This will cause will cause the process executing the remote > > > > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault > > > > > path. This will be documented behavior of remote charging, and this feature is > > > > > opt-in. Users can: > > > > > - Not opt-into the feature if they want. > > > > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and > > > > > abort if they desire. > > > > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their > > > > > operation without executing the remote charge if possible. > > > > > > > > > > 2. Only processes allowed the enter cgroup at mount time can mount a > > > > > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups > > > > > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any > > > > > process with write access to this mount point will be able to charge memory to > > > > > <cgroup>. This is largely a non-issue because in configurations where there is > > > > > untrusted code running on the machine, mount point access needs to be > > > > > restricted to the intended users only regardless of whether the mount point > > > > > memory is deterministly charged or not. > > > > > > > > I'm not a fan of this. It uses filesystem mounts to create shareable > > > > resource domains outside of the cgroup hierarchy, which has all the > > > > downsides you listed, and more: > > > > > > > > 1. You need a filesystem interface in the first place, and a new > > > > ad-hoc channel and permission model to coordinate with the cgroup > > > > tree, which isn't great. All filesystems you want to share data on > > > > need to be converted. > > > > > > > > 2. It doesn't extend to non-filesystem sources of shared data, such as > > > > memfds, ipc shm etc. > > > > > > > > 3. It requires unintuitive configuration for what should be basic > > > > shared accounting semantics. Per default you still get the old > > > > 'first touch' semantics, but to get sharing you need to reconfigure > > > > the filesystems? > > > > > > > > 4. If a task needs to work with a hierarchy of data sharing domains - > > > > system-wide, group of jobs, job - it must interact with a hierarchy > > > > of filesystem mounts. This is a pain to setup and may require task > > > > awareness. Moving data around, working with different mount points. > > > > Also, no shared and private data accounting within the same file. > > > > > > > > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are > > > > entangled, sitting in disjunct domains. OOM killing is one quirk of > > > > that, but there are others you haven't touched on. Who is charged > > > > for the CPU cycles of reclaim in the out-of-band domain? Who is > > > > charged for the paging IO? How is resource pressure accounted and > > > > attributed? Soon you need cpu= and io= as well. > > > > > > > > My take on this is that it might work for your rather specific > > > > usecase, but it doesn't strike me as a general-purpose feature > > > > suitable for upstream. > > > > > > > > > > > > If we want sharing semantics for memory, I think we need a more > > > > generic implementation with a cleaner interface. > > > > > > > > Here is one idea: > > > > > > > > Have you considered reparenting pages that are accessed by multiple > > > > cgroups to the first common ancestor of those groups? > > > > > > > > Essentially, whenever there is a memory access (minor fault, buffered > > > > IO) to a page that doesn't belong to the accessing task's cgroup, you > > > > find the common ancestor between that task and the owning cgroup, and > > > > move the page there. > > > > > > > > With a tree like this: > > > > > > > > root - job group - job > > > > `- job > > > > `- job group - job > > > > `- job > > > > > > > > all pages accessed inside that tree will propagate to the highest > > > > level at which they are shared - which is the same level where you'd > > > > also set shared policies, like a job group memory limit or io weight. > > > > > > > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs > > > > pages would bubble to the respective job group, private data would > > > > stay within each job. > > > > > > > > No further user configuration necessary. Although you still *can* use > > > > mount namespacing etc. to prohibit undesired sharing between cgroups. > > > > > > > > The actual user-visible accounting change would be quite small, and > > > > arguably much more intuitive. Remember that accounting is recursive, > > > > meaning that a job page today also shows up in the counters of job > > > > group and root. This would not change. The only thing that IS weird > > > > today is that when two jobs share a page, it will arbitrarily show up > > > > in one job's counter but not in the other's. That would change: it > > > > would no longer show up as either, since it's not private to either; > > > > it would just be a job group (and up) page. > > > > These are great questions. > > > > > In general I like the idea, but I think the user-visible change will be quite > > > large, almost "cgroup v3"-large. > > > > I wouldn't quite say cgroup3 :-) But it would definitely require a new > > mount option for cgroupfs. > > > > > Here are some problems: > > > 1) Anything shared between e.g. system.slice and user.slice now belongs > > > to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache > > > belonging to shared libraries. > > > > Correct, but arguably that's a good thing: > > > > Right now, even though the libraries are used by both, they'll be held > > by one group. This can cause two priority inversions: hipri references > > don't prevent the shared page from thrashing inside a lowpri group, > > which could subject the hipri group to reclaim pressure and waiting > > for slow refaults of the lowpri groups; if the lowpri group is the > > hotter user of this page, this could sustain. Or the page ends up in > > the hipri group, and the lowpri group pins it there even when the > > hipri group is done with it, thus stealing its capacity. > > > > Yes, a libc page used by everybody in the system would end up in the > > root cgroup. But arguably that makes much more sense than having it > > show up as exclusive memory of system.slice/systemd-udevd.service. > > And certainly we don't want a universally shared page be subjected to > > the local resource pressure of one lowpri user of it. > > > > Recognizing the shared property and propagating it to the common > > domain - the level at which priorities are equal between them - would > > make the accounting clearer and solve both these inversions. > > > > > 2) It's concerning in security terms. If I understand the idea correctly, a > > > read-only access will allow to move charges to an upper level, potentially > > > crossing memory.max limits. It doesn't sound safe. > > > > Hm. The mechanism is slightly different, but escaping memory.max > > happens today as well: shared memory is already not subject to the > > memory.max of (n-1)/n cgroups that touch it. > > > > So before, you can escape containment to whatever other cgroup is > > using the page. After, you can escape to the common domain. It's > > difficult for me to say one is clearly worse than the other. You can > > conceive of realistic scenarios where both are equally problematic. > > > > Practically, they appear to require the same solution: if the > > environment isn't to be trusted, namespacing and limiting access to > > shared data is necessary to avoid cgroups escaping containment or > > DoSing other groups. > > > > > 3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent > > > it returns us to the cgroup v1 world and a question of competition between > > > resources consumed by a cgroup directly and through children cgroups. Not > > > like the problem doesn't exist now, but it's less pronounced. > > > If say >50% of system.slice's memory will belong to system.slice directly, > > > then we likely will need separate non-recursive counters, limits, protections, > > > etc. > > > > I actually do see numbers like this in practice. Temporary > > system.slice units allocate cache, then their cgroups get deleted and > > the cache is reused by the next instances. Quite often, system.slice > > has much more memory than its subgroups combined. > > > > So in a way, we have what I'm proposing if the sharing happens with > > dead cgroups. Sharing with live cgroups wouldn't necessarily create a > > bigger demand for new counters than what we have now. > > > > I think the cgroup1 issue was slightly different: in cgroup1 we > > allowed *tasks* to live in non-leaf groups, and so users wanted to > > control the *private* memory of said tasks with policies that were > > *different* from the shared policies applied to the leaves. > > > > This wouldn't be the same here. Tasks are still only inside leafs, and > > there is no "private" memory inside a non-leaf group. It's shared > > among the children, and so subject to policies shared by all children. > > > > > 4) Imagine a production server and a system administrator entering using ssh > > > (and being put into user.slice) and running a big grep... It screws up all > > > memory accounting until a next reboot. Not a completely impossible scenario. > > > > This can also happen with the first-touch model, though. The second > > you touch private data of some workload, the memory might escape it. > > > > It's not as pronounced with a first-touch policy - although proactive > > reclaim makes this worse. But I'm not sure you can call it a new > > concern in the proposed model: you already have to be careful with the > > data you touch and bring into memory from your current cgroup. > > > > Again, I think this is where mount namespaces come in. You're not > > necessarily supposed to see private data of workloads from the outside > > and access it accidentally. It's common practice to ssh directly into > > containers to muck with them and their memory, at which point you'll > > be in the appropriate cgroup and permission context, too. > > > > However, I do agree with Mina and you: this is a significant change in > > behavior, and a cgroupfs mount option would certainly be warranted. > > I don't mean to be a nag here but I have trouble seeing pages being > re-accounted on minor faults working for us, and that might be fine, > but I'm expecting if it doesn't really work for us it likely won't > work for the next person trying to use this. Yes, I agree, the performance impact might be non-trivial. I think we discussed something similar in the past in the context of re-charging pages belonging to a deleted cgroup. And the consensus was that we'd need to add hooks into many places to check whether a page belongs to a dying (or other-than-current) cgroup and it might be not cheap. > > The issue is that the fact that the memory is initially accounted to > the allocating process forces the sysadmin to overprovision the cgroup > limit anyway so that the tasks don't oom if tasks are pre-allocating > memory. The memory usage of a task accessing shared memory stays very > unpredictable because it's waiting on another task in another cgroup > to touch the shared memory for the shared memory to be unaccounted to > its cgroup. > > I have a couple of (admittingly probably controversial) suggestions: > 1. memcg flag, say memory.charge_for_shared_memory. When we allocate > shared memory, we charge it to the first ancestor memcg that has > memory.charge_for_shared_memory==true. I think the problem here is that we try really hard to avoid any per-memory-type knobs, and this is another one. > 2. Somehow on the creation of shared memory, we somehow declare that > this memory belongs to <cgroup>. Only descendants of <cgroup> are able > to touch the shared memory and the shared memory is charged to > <cgroup>. This sounds like a mount namespace. Thanks!