On Fri, 2024-05-03 at 10:52 +0200, Peter Zijlstra wrote: > On Thu, May 02, 2024 at 09:20:15AM -1000, Tejun Heo wrote: > > Hello, Peter. > > > > On Thu, May 02, 2024 at 10:48:00AM +0200, Peter Zijlstra wrote: > > > Can you please put your efforts and the touted Google > > > collaboration in > > > fixing the existing cgroup mess? > > > > I suppose you're referring to Rik's flattened hierarchy patchset. > > > > > > https://lore.kernel.org/all/20190822021740.15554-1-riel@xxxxxxxxxxx > > > > You guys Google/Facebook got us the cgroup thing, Google did a lot of > the work for cpu-cgroup, and now you Facebook say you can't live with > it > because it's too expensive. Yes Rik did put a lot of effort into it, > but > Google shot it down. What am I to do? I believe the issues that Paul pointed out with my flattened cgroup code are fixable. I ended up not getting back to this code because it took me a few months to think of ways to fix the issues Paul found, and by then I had moved on to other projects. For reference, Paul found these two (very real) issues with my implementation. 1) Thundering herd problem. If many tasks in a low priority cgroup wake up at the same time, they can end up swamping a CPU. I believe this can be solved with the same idea I had for reimplementing CONFIG_CFS_BANDWIDTH. Specifically, the code that determines the time slice length for a task already has a way to determine whether a CPU is "overloaded", and time slices need to be shortened. Once we reach that situation, we can place woken up tasks on a secondary heap of per-cgroup runqueues, from which we do not directly run tasks, but pick the lowest vruntime task from the lowest vruntime cgroup and put that on the main runqueue, if the previously running task has a vruntime that is higher than that of a task in the secondary group. If a task is woken up in a cgroup that already has tasks on that secondary queue, we wake up the task onto that secondary queue. This means on overloaded CPUs, we move back to a task selection mechanism closer to what we currently have, while in the non-overloaded situation we use a flat runqueue. This same scheme could be used to implement CFS bandwidth control. A task belonging to a throttled group would be placed on the group's queue, not the CPU's flat runqueue. 2) The vruntime for a task can be advanced by way to much at once. If we have tasks A & B running, and task B has a priority that is 1/100th of that of task A, its vruntime would be advanced 100x as much as task A, when running the same length time slice. This creates a big issue if we get a wakeup of task C, at the same priority as task B, and then task A goes to sleep. Due to the very far advanced runtime of task B, task C could get to monopolize the CPU for a considerable amount of time, and task B could get starved. A potential fix for this is to never account more than the maximum time slice length at a time, while any excess delta_exec time for the task gets remembered. At pick_next_entity time, the scheduler can see that task B has a lot of delta_exec time left, and account up to the maximum slice length to the task's vruntime, and place it back in the queue if the next task now has a lower vruntime. For a steady state of a high priority task A and a low priority task B, this makes pick_next_task more expensive, but when task A disappears and task C appears, CPU time will continue to be fair between them. Limiting the total weight of tasks on the flat runqueue, using the mechanism for thundering herd and CFS bandwidth outlined above, should keep this overhead bounded to something reasonable. Does the above sound like it would work? Does it sound like code that you would be ok with merging? Is it a large enough improvement over the current hierarchical runqueue that it would be worth doing? This would be a fairly large project, so we should probably discuss some of the details before investing too much time in it. -- All Rights Reversed.