On Fri, Oct 22, 2021 at 12:18:59AM -0500, Mike Christie wrote: > The following patches apply over linus's tree and this patchset > > https://lore.kernel.org/all/20211007214448.6282-1-michael.christie@xxxxxxxxxx/ > > which allows us to check the vhost owner thread's RLIMITs: Unfortunately that patchset in turn triggers kbuild warnings. I was hoping you would address them, I don't think merging that patchset before kbuild issues are addressed is possible. It also doesn't have lots of acks, I'm a bit apprehensive of merging core changes like this through the vhost tree. Try to CC more widely/ping people? > It looks like that patchset has been ok'd by all the major parties > and just needs a small cleanup to apply to Jens and Paul trees, so I > wanted to post my threading patches based over it for review. > > The following patches allow us to support multiple vhost workers per > device. I ended up just doing Stefan's original idea where userspace has > the kernel create a worker and we pass back the pid. This has the benefit > over the workqueue and userspace thread approach where we only have > one'ish code path in the kernel during setup to detect old tools. The > main IO paths and device/vq setup/teardown paths all use common code. > > I've also included a patch for qemu so you can get an idea of how it > works. If we are ok with the kernel code then I'll break that up into > a patchset and send to qemu-devel for review. > > Results: > -------- > > fio jobs 1 2 4 8 12 16 > ---------------------------------------------------------- > 1 worker 84k 492k 510k - - - > worker per vq 184k 380k 744k 1422k 2256k 2434k > > Notes: > 0. This used a simple fio command: > > fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ > --ioengine=libaio --iodepth=128 --numjobs=$JOBS_ABOVE > > and I used a VM with 16 vCPUs and 16 virtqueues. > > 1. The patches were tested with emulate_pr=0 and these patches: > > https://lore.kernel.org/all/yq1tuhge4bg.fsf@xxxxxxxxxxxxxxxxxxxx/t/ > > which are in mkp's scsi branches for the next kernel. They fix the perf > issues where IOPs dropped at 12 vqs/jobs. > > 2. Because we have a hard limit of 1024 cmds, if the num jobs * iodepth > was greater than 1024, I would decrease iodepth. So 12 jobs used 85 cmds, > and 16 used 64. > > 3. The perf issue above at 2 jobs is because when we only have 1 worker > we execute more cmds per vhost_work due to all vqs funneling to one worker. > This results in less context switches and better batching without having to > tweak any settings. I'm working on patches to add back batching during lio > completion and do polling on the submission side. > > We will still want the threading patches, because if we batch at the fio > level plus use the vhost theading patches, we can see a big boost like > below. So hopefully doing it at the kernel will allow apps to just work > without having to be smart like fio. > > fio using io_uring and batching with the iodepth_batch* settings: > > fio jobs 1 2 4 8 12 16 > ------------------------------------------------------------- > 1 worker 494k 520k - - - - > worker per vq 496k 878k 1542k 2436k 2304k 2590k > > V3: > - fully convert vhost code to use vq based APIs instead of leaving it > half per dev and half per vq. > - rebase against kernel worker API. > - Drop delayed worker creation. We always create the default worker at > VHOST_SET_OWNER time. Userspace can create and bind workers after that. > > v2: > - change loop that we take a refcount to the worker in > - replaced pid == -1 with define. > - fixed tabbing/spacing coding style issue > - use hash instead of list to lookup workers. > - I dropped the patch that added an ioctl cmd to get a vq's worker's > pid. I saw we might do a generic netlink interface instead. > > >