The following patches apply over linus's tree or mst's vhost branch and my cleanup patchset: https://lists.linuxfoundation.org/pipermail/virtualization/2021-May/054354.html These patches allow us to support multiple vhost workers per device. I ended up just doing Stefan's original idea where userspace has the kernel create a worker and we pass back the pid. This has the benefit over the workqueue and userspace thread approach where we only have one'ish code path in the kernel during setup to detect old tools. The main IO paths and device/vq setup/teardown paths all use common code. The kernel patches here allow us to then do N workers device and also share workers across devices. I've also included a patch for qemu so you can get an idea of how it works. If we are ok with the kernel code then I'll break that up into a patchset and send to qemu-devel. Results: -------- When running with the null_blk driver and vhost-scsi I can get 1.2 million IOPs by just running a simple fio --filename=/dev/sda --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=128 --numjobs=8 --time_based --group_reporting --name=iops --runtime=60 --eta-newline=1 The VM has 8 vCPUs and sda has 8 virtqueues and we can do a total of 1024 cmds per devices. To get 1.2 million IOPs I did have to tune and ran the virsh emulatorpin command so the vhost threads were running on different CPUs than the VM. If the vhost threads share CPUs then I get around 800K. For a more real device that are also CPU hogs like iscsi, I can still get 1 million IOPs using 1 dm-multipath device over 8 iscsi paths (natively it gets 1.1 million IOPs). Results/TODO Note: - I ported the vdpa sim code to support multiple workers and as-is now it made perf much worse. If I increase vdpa_sim_blk's num queues to 4-8 I get 700K IOPs with the fio command above. However with the multiple worker support it drops to 400K. The problem is the vdpa_sim lock and the iommu_lock. If I hack (like comment out locks or not worry about data corruption or crashes) then I can get around 1.2M - 1.6M IOPs with 8 queues and fio command above. So these patches could help other drivers, but it will just take more work to remove those types of locks. I was hoping the 2 items could be done indepentently since it helps vhost-scsi immediately. TODO: - Stefano has 2 questions about security issues passing the pid back to userspace and if we should do a feature bit. We are waiting to hear back from the list. v2: - change loop that we take a refcount to the worker in - replaced pid == -1 with define. - fixed tabbing/spacing coding style issue - use hash instead of list to lookup workers. - I dropped the patch that added an ioctl cmd to get a vq's worker's pid. I saw we might do a generic netlink interface instead.