Hello, Rik. On Mon, May 06, 2024 at 02:47:47PM -0400, Rik van Riel wrote: > I believe the issues that Paul pointed out with my flattened cgroup code > are fixable. I ended up not getting back to this code because it took me a > few months to think of ways to fix the issues Paul found, and by then I > had moved on to other projects. > > For reference, Paul found these two (very real) issues with my > implementation. > > 1) Thundering herd problem. If many tasks in a low priority cgroup wake > up at the same time, they can end up swamping a CPU. The way that scx_flatcg (which seems broken right now, will fix it) works around this problem is by always doing two-level scheduling. ie. The top level rbtree hosts the tasks in the root cgroup and all the active cgroups, where each cgroup is scheduled according to their current flattened hierarchical share. This seems to work pretty well as what becomes really expensive is the repeated nesting which can easily go 6+ levels. This doesn't solve the thundering herd problem completely but shifts it one level. ie. Thundering herds of threads can be handled easily. However, thunderding herds of cgroups can still cause unfairness. I can imagine cases where this can lead to scheduling issues but they all seem pretty convoluted and artificial. Sure, somebody who's intentionally adversarial can cause temporary issues but perfect isolation of adversarial actors isn't what cgroups can or should practically target. Even in the case this is an actual issue, we can solve it by limited nesting. cgroups already have deligation boundaries. Maybe it needs to be made more explicit but one solution could be adding a nesting layer only on delegation boundaries so that misbehaviors are better contained within each delegation domain. > I believe this can be solved with the same idea I had for > reimplementing CONFIG_CFS_BANDWIDTH. Specifically, the code that > determines the time slice length for a task already has a way to > determine whether a CPU is "overloaded", and time slices need to be > shortened. Once we reach that situation, we can place woken up tasks on > a secondary heap of per-cgroup runqueues, from which we do not directly > run tasks, but pick the lowest vruntime task from the lowest vruntime > cgroup and put that on the main runqueue, if the previously running When overloaded, are the cgroups being put on a single rbtree? If so, they'd be using flattened shares, right? I wonder what you're suggesting for the overloaded case is pretty simliar to what flatcg is doing plus avoiding one level of indirection while not overloaded. Thanks. -- tejun