On Tue, Jan 11, 2022 at 12:58:53PM +0100, Peter Zijlstra wrote: > On Wed, Jan 05, 2022 at 07:46:55PM -0500, Daniel Jordan wrote: > > As before, helpers in multithreaded jobs don't honor the main thread's > > CFS bandwidth limits, which could lead to the group exceeding its quota. > > > > Fix it by having helpers remote charge their CPU time to the main > > thread's task group. A helper calls a pair of new interfaces > > cpu_cgroup_remote_begin() and cpu_cgroup_remote_charge() (see function > > header comments) to achieve this. > > > > This is just supposed to start a discussion, so it's pretty simple. > > Once a kthread has finished a remote charging period with > > cpu_cgroup_remote_charge(), its runtime is subtracted from the target > > task group's runtime (cfs_bandwidth::runtime) and any remainder is saved > > as debt (cfs_bandwidth::debt) to pay off in later periods. > > > > Remote charging tasks aren't throttled when the group reaches its quota, > > and a task group doesn't run at all until its debt is completely paid, > > but these shortcomings can be addressed if the approach ends up being > > taken. > > > > *groan*... and not a single word on why it wouldn't be much better to > simply move the task into the relevant cgroup.. Yes, the cover letter talks about that, I'll quote the relevant part here. --- 15 sched/fair: Account kthread runtime debt for CFS bandwidth 16 sched/fair: Consider kthread debt in cputime A prototype for remote charging in CFS bandwidth and cpu.stat, described more in the next section. It's debatable whether these last two are required for this series. Patch 12 caps the number of helper threads started according to the max effective CPUs allowed by the quota and period of the main thread's task group. In practice, I think this hits the sweet spot between complexity and respecting CFS bandwidth limits so that patch 15 might just be dropped. For instance, when running qemu with a vfio device, the restriction from patch 12 was enough to avoid the helpers breaching CFS bandwidth limits. That leaves patch 16, which on its own seems overkill for all the hunks it would require from patch 15, so it could be dropped too. Patch 12 isn't airtight, though, since other tasks running in the task group alongside the main thread and helpers could still result in overage. So, patches 15-16 give an idea of what absolutely correct accounting in the CPU controller might look like in case there are real situations that want it. Remote Charging in the CPU Controller ------------------------------------- CPU-intensive kthreads aren't generally accounted in the CPU controller, so they escape settings such as weight and bandwidth when they do work on behalf of a task group. This problem arises with multithreaded jobs, but is also an issue in other places. CPU activity from async memory reclaim (kswapd, cswapd?[5]) should be accounted to the cgroup that the memory belongs to, and similarly CPU activity from net rx should be accounted to the task groups that correspond to the packets being received. There are also vague complaints from Android[6]. Each use case has its own requirements[7]. In padata and reclaim, the task group to account to is known ahead of time, but net rx has to spend cycles processing a packet before its destination task group is known, so any solution should be able to work without knowing the task group in advance. Furthermore, the CPU controller shouldn't throttle reclaim or net rx in real time since both are doing high priority work. These make approaches that run kthreads directly in a task group, like cgroup-aware workqueues[8] or a kernel path for CLONE_INTO_CGROUP, infeasible. Running kthreads directly in cgroups also has a downside for padata because helpers' MAX_NICE priority is "shadowed" by the priority of the group entities they're running under. The proposed solution of remote charging can accrue debt to a task group to be paid off or forgiven later, addressing all these issues. A kthread calls the interface void cpu_cgroup_remote_begin(struct task_struct *p, struct cgroup_subsys_state *css); to begin remote charging to @css, causing @p's current sum_exec_runtime to be updated and saved. The @css arg isn't required and can be removed later to facilitate the unknown cgroup case mentioned above. Then the kthread calls another interface void cpu_cgroup_remote_charge(struct task_struct *p, struct cgroup_subsys_state *css); to account the sum_exec_runtime that @p has used since the first call. Internally, a new field cfs_bandwidth::debt is added to keep track of unpaid debt that's only used when the debt exceeds the quota in the current period. Weight-based control isn't implemented for now since padata helpers run at MAX_NICE and so always yield to anything higher priority, meaning they would rarely compete with other task groups. [ We have another use case to use remote charging for implementing CFS bandwidth control across multiple machines. This is an entirely different topic that deserves its own thread. ]