Hello, On Tue, Jan 11, 2022 at 11:29:50AM -0500, Daniel Jordan wrote: ... > This problem arises with multithreaded jobs, but is also an issue in other > places. CPU activity from async memory reclaim (kswapd, cswapd?[5]) should be > accounted to the cgroup that the memory belongs to, and similarly CPU activity > from net rx should be accounted to the task groups that correspond to the > packets being received. There are also vague complaints from Android[6]. These are pretty big holes in CPU cycle accounting right now and I think spend-first-and-backcharge is the right solution for most of them given experiences from other controllers. That said, > Each use case has its own requirements[7]. In padata and reclaim, the task > group to account to is known ahead of time, but net rx has to spend cycles > processing a packet before its destination task group is known, so any solution > should be able to work without knowing the task group in advance. Furthermore, > the CPU controller shouldn't throttle reclaim or net rx in real time since both > are doing high priority work. These make approaches that run kthreads directly > in a task group, like cgroup-aware workqueues[8] or a kernel path for > CLONE_INTO_CGROUP, infeasible. Running kthreads directly in cgroups also has a > downside for padata because helpers' MAX_NICE priority is "shadowed" by the > priority of the group entities they're running under. > > The proposed solution of remote charging can accrue debt to a task group to be > paid off or forgiven later, addressing all these issues. A kthread calls the > interface > > void cpu_cgroup_remote_begin(struct task_struct *p, > struct cgroup_subsys_state *css); > > to begin remote charging to @css, causing @p's current sum_exec_runtime to be > updated and saved. The @css arg isn't required and can be removed later to > facilitate the unknown cgroup case mentioned above. Then the kthread calls > another interface > > void cpu_cgroup_remote_charge(struct task_struct *p, > struct cgroup_subsys_state *css); > > to account the sum_exec_runtime that @p has used since the first call. > Internally, a new field cfs_bandwidth::debt is added to keep track of unpaid > debt that's only used when the debt exceeds the quota in the current period. > > Weight-based control isn't implemented for now since padata helpers run at > MAX_NICE and so always yield to anything higher priority, meaning they would > rarely compete with other task groups. If we're gonna do this, let's please do it right and make weight based control work too. Otherwise, its usefulness is pretty limited. Thanks. -- tejun