At Linuxcon last year, based on our presentation "vhost: sharing is better" [1], we had briefly discussed the idea of cgroup aware workqueues with Tejun. The following patches are a result of the discussion. They are in no way complete in that the changes are for unbounded workqueues only, but I just wanted to present my unfinished work as RFC and get some feedback. 1/4 and 3/4 are simple cgroup changes and add a helper function. 2/4 is the main implementation. 4/4 changes vhost to use workqueues with support for cgroups. Accounting: When servicing a userspace task A attached to cgroup X, for cgroup awareness, a worker thread could attach to all cgroups of the task which it is servicing. This patch does it for unbound workqueues which means all tasks that are bound to certain cgroups could potentially be serviced by the same worker thread. However, the same technique could be applicable to bounded workqueues as well. Example: vhost creates a worker thread when invoked for a kvm guest. Since, the guest is a normal process, the kernel thread servicing it should be attached to the vm process' cgroups. Design: The fundamental addition is a cgroup aware worker pool and as stated above, for the unbounded case only. These changes don't populate the "numa awareness" fields/attrs and unlike unbounded numa worker pools, cgroup worker pools are created on demand. Every work request could potentially have a new cgroup aware pool created for it based on the combination of cgroups it's attached to. However, workqueues themselves are incognizant of the actual cgroups - they rely on the cgroups provided helper functions either for 1. a match of all the cgroups or 2. to attach a worker thread to all cgroups of a userspace task. We do maintain a list of cgroup aware pools so that when a new request comes in and a suitable worker pool needs to be found, we search the list first before creating a new one. A worker pool also stores a a list of all "task owners" - a list of processes that we are serving currently. Testing: Create some qemu processes and attaching them to different cgroups. Verifying that new worker pools are created for tasks that are attached to different cgroups (and reuse for the ones that belong to the same). Some simple performace testing using netperf below. Although, these numbers shouldn't be dependent on these patches. The cgroup attach and match functions are not in hot paths for general usage which is what this test does. Netperf: Two guests running netperf in parallel. Without patches With patches TCP_STREAM (10^6 bits/second) 975.45 978.88 TCP_RR (Trans/second) 20121 18820.82 UDP_STREAM (10^6 bits/second) 1287.82 1184.5 UDP_RR (Trans/second) 20766.72 19667.08 Time a 4G iso download 2m 33 seconds 3m 02 seconds Todo: What about bounded workqueues ? What happens when cgroups of a running process changes ? sysfs variables Sanity check the flush and destroy path. More extensive testing Can we optimize the search/match/attach functions ? Better performance numbers ? (although the onese above don't look bad) [1] http://events.linuxfoundation.org/sites/events/files/slides/kvm_forum_2015_vhost_sharing_is_better.pdf Bandan Das (4): cgroup: Introduce a function to compare two tasks workqueue: introduce support for attaching to cgroups cgroup: use spin_lock_irq for cgroup match and attach fns vhost: use workqueues for the works drivers/vhost/vhost.c | 103 ++++++++++++++++++--- drivers/vhost/vhost.h | 2 + include/linux/cgroup.h | 1 + include/linux/workqueue.h | 2 + kernel/cgroup.c | 40 ++++++++- kernel/workqueue.c | 212 +++++++++++++++++++++++++++++++++++++++++--- kernel/workqueue_internal.h | 4 + 7 files changed, 335 insertions(+), 29 deletions(-) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html