[PATCH V3 00/11] vhost: multiple worker support

Mike Christie <michael.christie@xxxxxxxxxx> · Fri, 22 Oct 2021 00:18:59 -0500

The following patches apply over linus's tree and this patchset

https://lore.kernel.org/all/20211007214448.6282-1-michael.christie@xxxxxxxxxx/

which allows us to check the vhost owner thread's RLIMITs:

It looks like that patchset has been ok'd by all the major parties
and just needs a small cleanup to apply to Jens and Paul trees, so I
wanted to post my threading patches based over it for review.

The following patches allow us to support multiple vhost workers per
device. I ended up just doing Stefan's original idea where userspace has
the kernel create a worker and we pass back the pid. This has the benefit
over the workqueue and userspace thread approach where we only have
one'ish code path in the kernel during setup to detect old tools. The
main IO paths and device/vq setup/teardown paths all use common code.

I've also included a patch for qemu so you can get an idea of how it
works. If we are ok with the kernel code then I'll break that up into
a patchset and send to qemu-devel for review.

Results:
--------

fio jobs        1       2       4       8       12      16
----------------------------------------------------------
1 worker        84k    492k    510k    -       -       -
worker per vq   184k   380k    744k    1422k   2256k   2434k

Notes:
0. This used a simple fio command:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=128  --numjobs=$JOBS_ABOVE

and I used a VM with 16 vCPUs and 16 virtqueues.

1. The patches were tested with emulate_pr=0 and these patches:

https://lore.kernel.org/all/yq1tuhge4bg.fsf@xxxxxxxxxxxxxxxxxxxx/t/

which are in mkp's scsi branches for the next kernel. They fix the perf
issues where IOPs dropped at 12 vqs/jobs.

2. Because we have a hard limit of 1024 cmds, if the num jobs * iodepth
was greater than 1024, I would decrease iodepth. So 12 jobs used 85 cmds,
and 16 used 64.

3. The perf issue above at 2 jobs is because when we only have 1 worker
we execute more cmds per vhost_work due to all vqs funneling to one worker.
This results in less context switches and better batching without having to
tweak any settings. I'm working on patches to add back batching during lio
completion and do polling on the submission side.

We will still want the threading patches, because if we batch at the fio
level plus use the vhost theading patches, we can see a big boost like
below. So hopefully doing it at the kernel will allow apps to just work
without having to be smart like fio.

fio using io_uring and batching with the iodepth_batch* settings:

fio jobs        1       2       4       8       12      16
-------------------------------------------------------------
1 worker        494k    520k    -       -       -       -
worker per vq   496k    878k    1542k   2436k   2304k   2590k

V3:
- fully convert vhost code to use vq based APIs instead of leaving it
half per dev and half per vq.
- rebase against kernel worker API.
- Drop delayed worker creation. We always create the default worker at
VHOST_SET_OWNER time. Userspace can create and bind workers after that.

v2:
- change loop that we take a refcount to the worker in
- replaced pid == -1 with define.
- fixed tabbing/spacing coding style issue
- use hash instead of list to lookup workers.
- I dropped the patch that added an ioctl cmd to get a vq's worker's
pid. I saw we might do a generic netlink interface instead.