Re: [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity

Stefano Garzarella <sgarzare@xxxxxxxxxx> · Wed, 9 Dec 2020 16:58:53 +0100

Hi Mike,
sorry for the delay but there were holidays.

On Fri, Dec 04, 2020 at 11:33:11AM -0600, Mike Christie wrote:
On 12/4/20 11:10 AM, Mike Christie wrote:
On 12/4/20 10:06 AM, Stefano Garzarella wrote:
Hi Mike,

On Fri, Dec 04, 2020 at 01:56:25AM -0600, Mike Christie wrote:
These patches were made over mst's vhost branch.

The following patches, made over mst's vhost branch, allow userspace
to set each vq's cpu affinity. Currently, with cgroups the worker thread
inherits the affinity settings, but we are at the mercy of the CPU
scheduler for where the vq's IO will be executed on. This can result in
the scheduler sometimes hammering a couple queues on the host instead of
spreading it out like how the guest's app might have intended if it was
mq aware.

This version of the patches is not what you guys were talking about
initially like with the interface that was similar to nbd's old
(3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
kernel and we run from that context. These patches instead just
allow userspace to tell the kernel which CPU a vq should run on.
We then use the kernel's workqueue code to handle the thread
management.

I agree that reusing kernel's workqueue code would be a good strategy.

One concern is how easy it is to implement an adaptive polling 
strategy using workqueues. From what I've seen, adding some 
polling of both backend and virtqueue helps to eliminate 
interrupts and reduce latency.

Would the polling you need to do be similar to the vhost net poll 
code like in vhost_net_busy_poll (different algorithm though)? But, 
we want to be able to poll multiple devs/vqs from the same CPU 
right? Something like:

retry:

for each poller on CPU N
    if poller has work
        driver->run work fn

if (poll limit hit)
    return
else
    cpu_relax();
goto retry:

?

Yeah, something similar. IIUC vhost_net_busy_poll() polls both vring and 
backend (socket).

Maybe we need to limit the work->fn amount of work to avoid starvation.

If so, I had an idea for it. Let me send an additional patch on top 
of this set.

Sure :-)

Oh yeah, just to make sure I am on the same page for vdpa, because 
scsi and net work so differnetly.

Were you thinking that you would initially run from

vhost_poll_wakeup -> work->fn

then in the vdpa work->fn you would do the kick_vq still, but then 
also kick off a group backend/vq poller. This would then poll the 
vqs/devs that were bound to that CPU from the worker/wq thread.

Yes, this seams reasonable!

So I was thinking you want something similar to network's NAPI. Here 

I don't know NAPI very well, but IIUC the goal is the same: try to avoid 
notifications (IRQs from the device, vm-exit from the guest) doing an 
adaptive polling.

our work->fn is the hard irq, and then the worker is like their softirq 
we poll from.

I'm a little lost here...

Thanks,
Stefano