Here's phase two of padata multithreaded jobs, which multithreads VFIO page pinning and lays the groundwork for other padata users. It's RFC because there are still pieces missing and testing to do, and because of the last two patches, which I'm hoping scheduler and cgroup folks can weigh in on. Any and all feedback is welcome. --- Assigning a VFIO device to a guest requires pinning each and every page of the guest's memory, which gets expensive for large guests even if the memory has already been faulted in and cleared with something like qemu prealloc. Some recent optimizations[0][1] have brought the cost down, but it's still a significant bottleneck for guest initialization time. Parallelize with padata to take proper advantage of memory bandwidth, yielding up to 12x speedups for VFIO page pinning and 10x speedups for overall qemu guest initialization. Detailed performance results are in patch 8. Phase one[4] of multithreaded jobs made deferred struct page init use all the CPUs on x86. That's a special case because it happens during boot when the machine is waiting on page init to finish and there are generally no resource controls to violate. Page pinning, on the other hand, can be done by a user task (the "main thread" in a job), so helper threads should honor the main thread's resource controls that are relevant for pinning (CPU, memory) and give priority to other tasks on the system. This RFC has some but not all of the pieces to do that. After this phase, it shouldn't take many lines to parallelize other memory-proportional paths like struct page init for memory hotplug, munmap(), hugetlb_fallocate(), and __ib_umem_release(). The first half of this series (more or less) has been running in our kernels for about three years. Changelog --------- This addresses some comments on two earlier projects, ktask[2] and cgroup-aware workqueues[3]. - Fix undoing partially a completed chunk in the thread function, and use larger minimum chunk size (Alex Williamson) - Helper threads should honor the main thread's settings and resource controls, and shouldn't disturb other tasks (Michal Hocko, Pavel Machek) - Design comments, lockdep awareness (Peter Zijlstra, Jason Gunthorpe) - Implement remote charging in the CPU controller (Tejun Heo) Series Rundown -------------- 1 padata: Remove __init from multithreading functions 2 padata: Return first error from a job 3 padata: Add undo support 4 padata: Detect deadlocks between main and helper threads Get ready to parallelize. In particular, pinning can fail, so make jobs undo-able. 5 vfio/type1: Pass mm to vfio_pin_pages_remote() 6 vfio/type1: Refactor dma map removal 7 vfio/type1: Parallelize vfio_pin_map_dma() 8 vfio/type1: Cache locked_vm to ease mmap_lock contention Do the parallelization itself. 9 padata: Use kthreads in do_multithreaded 10 padata: Helpers should respect main thread's CPU affinity 11 padata: Cap helpers started to online CPUs 12 sched, padata: Bound max threads with max_cfs_bandwidth_cpus() Put caps on the number of helpers started according to the main thread's CPU affinity, the system' online CPU count, and the main thread's CFS bandwidth settings. 13 padata: Run helper threads at MAX_NICE 14 padata: Nice helper threads one by one to prevent starvation Prevent helpers from taking CPU away unfairly from other tasks for the sake of an optimized kernel code path. 15 sched/fair: Account kthread runtime debt for CFS bandwidth 16 sched/fair: Consider kthread debt in cputime A prototype for remote charging in CFS bandwidth and cpu.stat, described more in the next section. It's debatable whether these last two are required for this series. Patch 12 caps the number of helper threads started according to the max effective CPUs allowed by the quota and period of the main thread's task group. In practice, I think this hits the sweet spot between complexity and respecting CFS bandwidth limits so that patch 15 might just be dropped. For instance, when running qemu with a vfio device, the restriction from patch 12 was enough to avoid the helpers breaching CFS bandwidth limits. That leaves patch 16, which on its own seems overkill for all the hunks it would require from patch 15, so it could be dropped too. Patch 12 isn't airtight, though, since other tasks running in the task group alongside the main thread and helpers could still result in overage. So, patches 15-16 give an idea of what absolutely correct accounting in the CPU controller might look like in case there are real situations that want it. Remote Charging in the CPU Controller ------------------------------------- CPU-intensive kthreads aren't generally accounted in the CPU controller, so they escape settings such as weight and bandwidth when they do work on behalf of a task group. This problem arises with multithreaded jobs, but is also an issue in other places. CPU activity from async memory reclaim (kswapd, cswapd?[5]) should be accounted to the cgroup that the memory belongs to, and similarly CPU activity from net rx should be accounted to the task groups that correspond to the packets being received. There are also vague complaints from Android[6]. Each use case has its own requirements[7]. In padata and reclaim, the task group to account to is known ahead of time, but net rx has to spend cycles processing a packet before its destination task group is known, so any solution should be able to work without knowing the task group in advance. Furthermore, the CPU controller shouldn't throttle reclaim or net rx in real time since both are doing high priority work. These make approaches that run kthreads directly in a task group, like cgroup-aware workqueues[8] or a kernel path for CLONE_INTO_CGROUP, infeasible. Running kthreads directly in cgroups also has a downside for padata because helpers' MAX_NICE priority is "shadowed" by the priority of the group entities they're running under. The proposed solution of remote charging can accrue debt to a task group to be paid off or forgiven later, addressing all these issues. A kthread calls the interface void cpu_cgroup_remote_begin(struct task_struct *p, struct cgroup_subsys_state *css); to begin remote charging to @css, causing @p's current sum_exec_runtime to be updated and saved. The @css arg isn't required and can be removed later to facilitate the unknown cgroup case mentioned above. Then the kthread calls another interface void cpu_cgroup_remote_charge(struct task_struct *p, struct cgroup_subsys_state *css); to account the sum_exec_runtime that @p has used since the first call. Internally, a new field cfs_bandwidth::debt is added to keep track of unpaid debt that's only used when the debt exceeds the quota in the current period. Weight-based control isn't implemented for now since padata helpers run at MAX_NICE and so always yield to anything higher priority, meaning they would rarely compete with other task groups. [ We have another use case to use remote charging for implementing CFS bandwidth control across multiple machines. This is an entirely different topic that deserves its own thread. ] TODO ---- - Honor these other resource controls: - Memory controller limits for helpers via active_memcg. I *think* this will turn out to be necessary despite helpers using the main thread's mm, but I need to look into it more. - cpuset.mems - NUMA memory policy - Make helpers aware of signals sent to the main thread - Test test test Series based on 5.14. I had to downgrade from 5.15 because of an intel iommu bug that's since been fixed. thanks, Daniel [0] https://lore.kernel.org/linux-mm/20210128182632.24562-1-joao.m.martins@xxxxxxxxxx [1] https://lore.kernel.org/lkml/20210219161305.36522-1-daniel.m.jordan@xxxxxxxxxx/ [2] https://x-lore.kernel.org/all/20181105165558.11698-1-daniel.m.jordan@xxxxxxxxxx/ [3] https://lore.kernel.org/linux-mm/20190605133650.28545-1-daniel.m.jordan@xxxxxxxxxx/ [4] https://x-lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@xxxxxxxxxx/ [5] https://x-lore.kernel.org/all/20200219181219.54356-1-hannes@xxxxxxxxxxx/ [6] https://x-lore.kernel.org/all/20210407013856.GC21941@xxxxxxxxxxxxxx/ [7] https://x-lore.kernel.org/all/20200219214112.4kt573kyzbvmbvn3@xxxxxxxxxxxxxxxxxxxxxxxxxx/ [8] https://x-lore.kernel.org/all/20190605133650.28545-1-daniel.m.jordan@xxxxxxxxxx/ Daniel Jordan (16): padata: Remove __init from multithreading functions padata: Return first error from a job padata: Add undo support padata: Detect deadlocks between main and helper threads vfio/type1: Pass mm to vfio_pin_pages_remote() vfio/type1: Refactor dma map removal vfio/type1: Parallelize vfio_pin_map_dma() vfio/type1: Cache locked_vm to ease mmap_lock contention padata: Use kthreads in do_multithreaded padata: Helpers should respect main thread's CPU affinity padata: Cap helpers started to online CPUs sched, padata: Bound max threads with max_cfs_bandwidth_cpus() padata: Run helper threads at MAX_NICE padata: Nice helper threads one by one to prevent starvation sched/fair: Account kthread runtime debt for CFS bandwidth sched/fair: Consider kthread debt in cputime drivers/vfio/Kconfig | 1 + drivers/vfio/vfio_iommu_type1.c | 170 ++++++++++++++--- include/linux/padata.h | 31 +++- include/linux/sched.h | 2 + include/linux/sched/cgroup.h | 37 ++++ kernel/padata.c | 311 +++++++++++++++++++++++++------- kernel/sched/core.c | 58 ++++++ kernel/sched/fair.c | 99 +++++++++- kernel/sched/sched.h | 5 + mm/page_alloc.c | 4 +- 10 files changed, 620 insertions(+), 98 deletions(-) create mode 100644 include/linux/sched/cgroup.h base-commit: 7d2a07b769330c34b4deabeed939325c77a7ec2f -- 2.34.1