Re: [PATCH 13/14] vhost: allow userspace to create workers

Mike Christie <michael.christie@xxxxxxxxxx> · Thu, 18 May 2023 18:21:03 -0500

On 5/16/23 10:10 PM, Jason Wang wrote:
> On Sat, Apr 29, 2023 at 12:32 AM Mike Christie
> <michael.christie@xxxxxxxxxx> wrote:
>>
>> For vhost-scsi with 3 vqs or more and a workload that tries to use
>> them in parallel like:
>>
>> fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
>> --ioengine=libaio --iodepth=128  --numjobs=3
>>
>> the single vhost worker thread will become a bottlneck and we are stuck
>> at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are
>> used.
>>
>> To better utilize virtqueues and available CPUs, this patch allows
>> userspace to create workers and bind them to vqs. You can have N workers
>> per dev and also share N workers with M vqs on that dev.
>>
>> This patch adds the interface related code and the next patch will hook
>> vhost-scsi into it. The patches do not try to hook net and vsock into
>> the interface because:
>>
>> 1. multiple workers don't seem to help vsock. The problem is that with
>> only 2 virtqueues we never fully use the existing worker when doing
>> bidirectional tests. This seems to match vhost-scsi where we don't see
>> the worker as a bottleneck until 3 virtqueues are used.
>>
>> 2. net already has a way to use multiple workers.
>>
>> Signed-off-by: Mike Christie <michael.christie@xxxxxxxxxx>
>> ---
>>  drivers/vhost/vhost.c            | 145 ++++++++++++++++++++++++++++++-
>>  drivers/vhost/vhost.h            |   3 +
>>  include/uapi/linux/vhost.h       |  33 +++++++
>>  include/uapi/linux/vhost_types.h |  16 ++++
>>  4 files changed, 196 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 4b0b82292379..e8f829f35814 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -630,6 +630,80 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
>>         return NULL;
>>  }
>>
>> +/* Caller must have device mutex */
>> +static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
>> +                                    struct vhost_worker *worker)
>> +{
>> +       if (vq->worker)
>> +               vq->worker->attachment_cnt--;
>> +       worker->attachment_cnt++;
>> +       vq->worker = worker;
>> +}
>> +
>> +/**
>> + * vhost_vq_attach_worker - set a virtqueue's worker from an ioctl command
>> + * @vq: the virtqueue we will set the worker for
>> + * @info: the worker userspace has requested us to use
>> + *
>> + * We only allow userspace to set a virtqueue's worker if it's not active and
>> + * polling is not enabled.
> 
> I wonder if we can mandate this in the code like check the vq backend
> in vhost_vq_work_queue().

I'll look into it. However, for the vsock case we actually do want to
queue the work even though there is no backend set yet. It sort of just
keeps queueing works until VHOST_VSOCK_SET_RUNNING is executed. When that's
done it will run the works that have been queueing up.

> 
>  We also assume drivers supporting this will not be
>> + * internally queueing works directly or via calls like vhost_dev_flush at
>> + * this time.
>> + *
>> + * Caller must have device and virtqueue mutex.
>> + */
>> +static int vhost_vq_attach_worker(struct vhost_virtqueue *vq,
>> +                                 struct vhost_vring_worker *info)
>> +{
>> +       unsigned long index = info->worker_id;
>> +       struct vhost_dev *dev = vq->dev;
>> +       struct vhost_worker *worker;
>> +
>> +       if (!dev->use_worker)
>> +               return -EINVAL;
>> +
>> +       if (vhost_vq_get_backend(vq) || vq->kick)
> 
> It might be worthwhile to have a comment to explain why we need to
> check vq->kick here.
> 
> This also means the device should not queue work when the backend is NULL.
> 
> But I found it is probably not the case for vsock, it calls
> vhost_poll_queue() in vhost_transport_cancel_pkt() but
> vhost_vsock_stop() doesn't wait before doing vhost_vq_set_backend(vq,
> NULL);
Yeah, there was another case where vhost_transport_send_pkt can call
vhost_work_queue before the backend is set.

I ended up doing the opt in method though. I did a test to convert vsock
and a worker for the recv queue and one for the xmit queue doesn't help.
It was like with vhost-scsi where with just 2 virtqueues, one worker could
handle the load. So I thought there is no point in adding extra code for
vsock if it wasn't going to be used.

> 
> Net seems to be fine since it waits for ubufs to be completed in
> vhost_net_set_backend().
> 
> Can we make things easier by migrating the work_list? I also worry if
> there are other corner cases which makes me think how hard it is if we
> can just support those ioctls after the backend is set?

It's not too diffcult to support attaching/detaching workers from queues
when IO is running. One of the versions I posted used RCU. However, it adds
complexity and I didn't have a use case so I didn't want add code that most
likely wouldn't be used.

>>  void vhost_clear_msg(struct vhost_dev *dev);
>> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
>> index 92e1b700b51c..155710585979 100644
>> --- a/include/uapi/linux/vhost.h
>> +++ b/include/uapi/linux/vhost.h
>> @@ -45,6 +45,25 @@
>>  #define VHOST_SET_LOG_BASE _IOW(VHOST_VIRTIO, 0x04, __u64)
>>  /* Specify an eventfd file descriptor to signal on log write. */
>>  #define VHOST_SET_LOG_FD _IOW(VHOST_VIRTIO, 0x07, int)
>> +/* By default, a device gets one vhost_worker that its virtqueues share. This
>> + * command allows the owner of the device to create an additional vhost_worker
>> + * for the device. It can later be bound to 1 or more of its virtqueues using
>> + * the VHOST_ATTACH_VRING_WORKER command.
>> + *
>> + * This must be called after VHOST_SET_OWNER and the caller must be the owner
>> + * of the device. The new thread will inherit caller's cgroups and namespaces,
>> + * and will share the caller's memory space. The new thread will also be
>> + * counted against the caller's RLIMIT_NPROC value.
> 
> This makes me think if we should destroy and re-create those after
> VHOST_RESET_OWNER?

Do you mean what happens if we did something like we destroy the cgroup after
VHOST_RESET_OWNER? You have to move the threads in the cgroup to another
cgroup before destroying it, so it depends on what's going on at the time.

For example if you are asking about a racey kind of case like we have moved the
parent/qemu thread to new cgroup, then we do VHOST_NEW_WORKER then the new thread
will end up in the new group with the parent. The existing threads might still be
it the old cgroup until you have finished moving them.

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization