Re: [RFC 1/5] vhost: reorder used descriptors in a batch

Jason Wang <jasowang@xxxxxxxxxx> · Thu, 4 Aug 2022 13:04:16 +0800

On Tue, Aug 2, 2022 at 10:12 PM Guo Zhi <qtxuning1999@xxxxxxxxxxx> wrote:
>
>
>
> ----- Original Message -----
> > From: "jasowang" <jasowang@xxxxxxxxxx>
> > To: "Guo Zhi" <qtxuning1999@xxxxxxxxxxx>
> > Cc: "eperezma" <eperezma@xxxxxxxxxx>, "sgarzare" <sgarzare@xxxxxxxxxx>, "Michael Tsirkin" <mst@xxxxxxxxxx>, "netdev"
> > <netdev@xxxxxxxxxxxxxxx>, "linux-kernel" <linux-kernel@xxxxxxxxxxxxxxx>, "kvm list" <kvm@xxxxxxxxxxxxxxx>,
> > "virtualization" <virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx>
> > Sent: Friday, July 29, 2022 3:32:02 PM
> > Subject: Re: [RFC 1/5] vhost: reorder used descriptors in a batch
>
> > On Thu, Jul 28, 2022 at 4:26 PM Guo Zhi <qtxuning1999@xxxxxxxxxxx> wrote:
> >>
> >> On 2022/7/26 15:36, Jason Wang wrote:
> >>
> >>
> >> 在 2022/7/21 16:43, Guo Zhi 写道:
> >>
> >> Device may not use descriptors in order, for example, NIC and SCSI may
> >> not call __vhost_add_used_n with buffers in order.  It's the task of
> >> __vhost_add_used_n to order them.
> >>
> >>
> >>
> >> I'm not sure this is ture. Having ooo descriptors is probably by design to have
> >> better performance.
> >>
> >> This might be obvious for device that may have elevator or QOS stuffs.
> >>
> >> I suspect the right thing to do here is, for the device that can't perform
> >> better in the case of IN_ORDER, let's simply not offer IN_ORDER (zerocopy or
> >> scsi). And for the device we know it can perform better, non-zercopy ethernet
> >> device we can do that.
> >>
> >>
> >>   This commit reorder the buffers using
> >> vq->heads, only the batch is begin from the expected start point and is
> >> continuous can the batch be exposed to driver.  And only writing out a
> >> single used ring for a batch of descriptors, according to VIRTIO 1.1
> >> spec.
> >>
> >>
> >>
> >> So this sounds more like a "workaround" of the device that can't consume buffer
> >> in order, I suspect it can help in performance.
> >>
> >> More below.
> >>
> >>
> >>
> >> Signed-off-by: Guo Zhi <qtxuning1999@xxxxxxxxxxx>
> >> ---
> >>   drivers/vhost/vhost.c | 44 +++++++++++++++++++++++++++++++++++++++++--
> >>   drivers/vhost/vhost.h |  3 +++
> >>   2 files changed, 45 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >> index 40097826c..e2e77e29f 100644
> >> --- a/drivers/vhost/vhost.c
> >> +++ b/drivers/vhost/vhost.c
> >> @@ -317,6 +317,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> >>       vq->used_flags = 0;
> >>       vq->log_used = false;
> >>       vq->log_addr = -1ull;
> >> +    vq->next_used_head_idx = 0;
> >>       vq->private_data = NULL;
> >>       vq->acked_features = 0;
> >>       vq->acked_backend_features = 0;
> >> @@ -398,6 +399,8 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
> >>                         GFP_KERNEL);
> >>           if (!vq->indirect || !vq->log || !vq->heads)
> >>               goto err_nomem;
> >> +
> >> +        memset(vq->heads, 0, sizeof(*vq->heads) * dev->iov_limit);
> >>       }
> >>       return 0;
> >>   @@ -2374,12 +2377,49 @@ static int __vhost_add_used_n(struct vhost_virtqueue
> >>   *vq,
> >>                   unsigned count)
> >>   {
> >>       vring_used_elem_t __user *used;
> >> +    struct vring_desc desc;
> >>       u16 old, new;
> >>       int start;
> >> +    int begin, end, i;
> >> +    int copy_n = count;
> >> +
> >> +    if (vhost_has_feature(vq, VIRTIO_F_IN_ORDER)) {
> >>
> >>
> >>
> >> How do you guarantee that ids of heads are contiguous?
> >>
> >> There is no need to be contiguous for ids of heads.
> >>
> >> For example, I have three buffer { .id = 0, 15}, {.id = 20, 30} {.id = 15, 20}
> >> for vhost_add_used_n. Then I will let the vq->heads[0].len=15.
> >> vq->heads[15].len=5, vq->heads[20].len=10 as reorder. Once I found there is no
> >> hold in the batched descriptors. I will expose them to driver.
> >
> > So spec said:
> >
> > "If VIRTIO_F_IN_ORDER has been negotiated, driver uses descriptors in
> > ring order: starting from offset 0 in the table, and wrapping around
> > at the end of the table."
> >
> > And
> >
> > "VIRTIO_F_IN_ORDER(35)This feature indicates that all buffers are used
> > by the device in the same order in which they have been made
> > available."
> >
> > This means your example is not an IN_ORDER device.
> >
> > The driver should submit buffers (assuming each buffer have one
> > descriptor) in order {id = 0, 15}, {id = 1, 30} and {id = 2, 20}.
> >
> > And even if it is submitted in order, we can not use a batch because:
> >
> > "The skipped buffers (for which no used ring entry was written) are
> > assumed to have been used (read or written) by the device completely."
> >
> > This means for TX we are probably ok, but for rx, unless we know the
> > buffers were written completely, we can't write them in a batch.
> >
> > I'd suggest to do cross testing for this series:
> >
> > 1) testing vhost IN_ORDER support with DPDK virtio PMD
> > 2) testing virtio IN_ORDER with DPDK vhost-user via testpmd
> >
> > Thanks
> >
> You are correct, for rx we can't do a batch because we have to let the driver know the length of buffers.

Note that we can do a batch for rx when we know all the buffers have
been fully written.

>
> I think these circumstances can offer batch:
> 1. tx
> 2. rx with RX_MRGBUF feature, which introduce a header for each received buffer
>
> Consider batch is not a mandatory requirement for in order feature according to spec.
> I'd like to let current RFC patch focus on in order implementation, and send another
> patch series to improve performance by batching on above circumstances.

That's fine, how about simply starting from the patch that offers
IN_ORDER when zerocopy is disabled?

Thanks

>
> What's your opinon.
>
> Thanks
> >
> >>
> >>
> >> +        /* calculate descriptor chain length for each used buffer */
> >>
> >>
> >>
> >> I'm a little bit confused about this comment, we have heads[i].len for this?
> >>
> >> Maybe I should not use vq->heads, some misleading.
> >>
> >>
> >> +        for (i = 0; i < count; i++) {
> >> +            begin = heads[i].id;
> >> +            end = begin;
> >> +            vq->heads[begin].len = 0;
> >>
> >>
> >>
> >> Does this work for e.g RX virtqueue?
> >>
> >>
> >> +            do {
> >> +                vq->heads[begin].len += 1;
> >> +                if (unlikely(vhost_get_desc(vq, &desc, end))) {
> >>
> >>
> >>
> >> Let's try hard to avoid more userspace copy here, it's the source of performance
> >> regression.
> >>
> >> Thanks
> >>
> >>
> >> +                    vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
> >> +                           end, vq->desc + end);
> >> +                    return -EFAULT;
> >> +                }
> >> +            } while ((end = next_desc(vq, &desc)) != -1);
> >> +        }
> >> +
> >> +        count = 0;
> >> +        /* sort and batch continuous used ring entry */
> >> +        while (vq->heads[vq->next_used_head_idx].len != 0) {
> >> +            count++;
> >> +            i = vq->next_used_head_idx;
> >> +            vq->next_used_head_idx = (vq->next_used_head_idx +
> >> +                          vq->heads[vq->next_used_head_idx].len)
> >> +                          % vq->num;
> >> +            vq->heads[i].len = 0;
> >> +        }
> >> +        /* only write out a single used ring entry with the id corresponding
> >> +         * to the head entry of the descriptor chain describing the last buffer
> >> +         * in the batch.
> >> +         */
> >> +        heads[0].id = i;
> >> +        copy_n = 1;
> >> +    }
> >>         start = vq->last_used_idx & (vq->num - 1);
> >>       used = vq->used->ring + start;
> >> -    if (vhost_put_used(vq, heads, start, count)) {
> >> +    if (vhost_put_used(vq, heads, start, copy_n)) {
> >>           vq_err(vq, "Failed to write used");
> >>           return -EFAULT;
> >>       }
> >> @@ -2410,7 +2450,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct
> >> vring_used_elem *heads,
> >>         start = vq->last_used_idx & (vq->num - 1);
> >>       n = vq->num - start;
> >> -    if (n < count) {
> >> +    if (n < count && !vhost_has_feature(vq, VIRTIO_F_IN_ORDER)) {
> >>           r = __vhost_add_used_n(vq, heads, n);
> >>           if (r < 0)
> >>               return r;
> >> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> >> index d9109107a..7b2c0fbb5 100644
> >> --- a/drivers/vhost/vhost.h
> >> +++ b/drivers/vhost/vhost.h
> >> @@ -107,6 +107,9 @@ struct vhost_virtqueue {
> >>       bool log_used;
> >>       u64 log_addr;
> >>   +    /* Sort heads in order */
> >> +    u16 next_used_head_idx;
> >> +
> >>       struct iovec iov[UIO_MAXIOV];
> >>       struct iovec iotlb_iov[64];
> >>       struct iovec *indirect;
> >>
> >>
> >>
>