On Fri, Aug 21, 2020 at 09:37:16PM +0200, Peter Zijlstra wrote: > On Tue, Aug 18, 2020 at 09:49:00AM -0400, Johannes Weiner wrote: > > On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz@xxxxxxxxxxxxx wrote: > > > What you need is a feeback loop against the rate of freeing pages, and > > > when you near the saturation point, the allocation rate should exactly > > > match the freeing rate. > > > > IO throttling solves a slightly different problem. > > > > IO occurs in parallel to the workload's execution stream, and you're > > trying to take the workload from dirtying at CPU speed to rate match > > to the independent IO stream. > > > > With memory allocations, though, freeing happens from inside the > > execution stream of the workload. If you throttle allocations, you're > > For a single task, but even then you're making the argument that we need > to allocate memory to free memory, and we all know where that gets us. > > But we're actually talking about a cgroup here, which is a collection of > tasks all doing things in parallel. Right, but sharing a memory cgroup means sharing an LRU list, and that transfers memory pressure and allocation burden between otherwise independent tasks - if nothing else through cache misses on the executables and libraries. I doubt that one task can go through several comprehensive reclaim cycles on a shared LRU without completely annihilating the latency or throughput targets of everybody else in the group in most real world applications. > > most likely throttling the freeing rate as well. And you'll slow down > > reclaim scanning by the same amount as the page references, so it's > > not making reclaim more successful either. The alloc/use/free > > (im)balance is an inherent property of the workload, regardless of the > > speed you're executing it at. > > Arguably seeing the rate drop to near 0 is a very good point to consider > running cgroup-OOM. Agreed. In the past, that's actually what we did: In cgroup1, you could disable the kernel OOM killer, and when reclaim failed at the limit, the allocating task would be put on a waitqueue until woken up by a freeing event. Conceptually this is clean & straight-forward. However, 1. Putting allocation contexts with unknown locks to indefinite sleep caused deadlocks, for obvious reasons. Userspace OOM killing tends to take a lot of task-specific locks when scanning through /proc files for kill candidates, and can easily get stuck. Using bounded over indefinite waits is simply acknowledging that the deadlock potential when connecting arbitrary task stacks in the system through free->alloc ordering is equally difficult to plan out as alloc->free ordering. The non-cgroup OOM killer actually has the same deadlock potential, where the allocating/killing task can hold resources that the OOM victim requires to exit. The OOM reaper hides it, the static emergency reserves hide it - but to truly solve this problem, you would have to have full knowledge of memory & lock ordering dependencies of those tasks. And then can still end up with scenarios where the only answer is panic(). 2. I don't recall ever seeing situations in cgroup1 where the precise matching of allocation rate to freeing rate has allowed cgroups to run sustainably after reclaim has failed. The practical benefit of a complicated feedback loop over something crude & robust once we're in an OOM situation is not apparent to me. [ That's different from the IO-throttling *while still doing reclaim* that Dave brought up. *That* justifies the same effort we put into dirty throttling. I'm only talking about the situation where reclaim has already failed and we need to facilitate userspace OOM handling. ] So that was the motivation for the bounded sleeps. They do not guarantee containment, but they provide a reasonable amount of time for the userspace OOM handler to intervene, without deadlocking. That all being said, the semantics of the new 'high' limit in cgroup2 have allowed us to move reclaim/limit enforcement out of the allocation context and into the userspace return path. See the call to mem_cgroup_handle_over_high() from tracehook_notify_resume(), and the comments in try_charge() around set_notify_resume(). This already solves the free->alloc ordering problem by allowing the allocation to exceed the limit temporarily until at least all locks are dropped, we know we can sleep etc., before performing enforcement. That means we may not need the timed sleeps anymore for that purpose, and could bring back directed waits for freeing-events again. What do you think? Any hazards around indefinite sleeps in that resume path? It's called before __rseq_handle_notify_resume and the arch-specific resume callback (which appears to be a no-op currently). Chris, Michal, what are your thoughts? It would certainly be simpler conceptually on the memcg side.