----- Original Message ----- > From: "jasowang" <jasowang@xxxxxxxxxx> > To: "Guo Zhi" <qtxuning1999@xxxxxxxxxxx> > Cc: "eperezma" <eperezma@xxxxxxxxxx>, "sgarzare" <sgarzare@xxxxxxxxxx>, "Michael Tsirkin" <mst@xxxxxxxxxx>, "netdev" > <netdev@xxxxxxxxxxxxxxx>, "linux-kernel" <linux-kernel@xxxxxxxxxxxxxxx>, "kvm list" <kvm@xxxxxxxxxxxxxxx>, > "virtualization" <virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx> > Sent: Thursday, August 4, 2022 1:04:16 PM > Subject: Re: [RFC 1/5] vhost: reorder used descriptors in a batch > On Tue, Aug 2, 2022 at 10:12 PM Guo Zhi <qtxuning1999@xxxxxxxxxxx> wrote: >> >> >> >> ----- Original Message ----- >> > From: "jasowang" <jasowang@xxxxxxxxxx> >> > To: "Guo Zhi" <qtxuning1999@xxxxxxxxxxx> >> > Cc: "eperezma" <eperezma@xxxxxxxxxx>, "sgarzare" <sgarzare@xxxxxxxxxx>, "Michael >> > Tsirkin" <mst@xxxxxxxxxx>, "netdev" >> > <netdev@xxxxxxxxxxxxxxx>, "linux-kernel" <linux-kernel@xxxxxxxxxxxxxxx>, "kvm >> > list" <kvm@xxxxxxxxxxxxxxx>, >> > "virtualization" <virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx> >> > Sent: Friday, July 29, 2022 3:32:02 PM >> > Subject: Re: [RFC 1/5] vhost: reorder used descriptors in a batch >> >> > On Thu, Jul 28, 2022 at 4:26 PM Guo Zhi <qtxuning1999@xxxxxxxxxxx> wrote: >> >> >> >> On 2022/7/26 15:36, Jason Wang wrote: >> >> >> >> >> >> 在 2022/7/21 16:43, Guo Zhi 写道: >> >> >> >> Device may not use descriptors in order, for example, NIC and SCSI may >> >> not call __vhost_add_used_n with buffers in order. It's the task of >> >> __vhost_add_used_n to order them. >> >> >> >> >> >> >> >> I'm not sure this is ture. Having ooo descriptors is probably by design to have >> >> better performance. >> >> >> >> This might be obvious for device that may have elevator or QOS stuffs. >> >> >> >> I suspect the right thing to do here is, for the device that can't perform >> >> better in the case of IN_ORDER, let's simply not offer IN_ORDER (zerocopy or >> >> scsi). And for the device we know it can perform better, non-zercopy ethernet >> >> device we can do that. >> >> >> >> >> >> This commit reorder the buffers using >> >> vq->heads, only the batch is begin from the expected start point and is >> >> continuous can the batch be exposed to driver. And only writing out a >> >> single used ring for a batch of descriptors, according to VIRTIO 1.1 >> >> spec. >> >> >> >> >> >> >> >> So this sounds more like a "workaround" of the device that can't consume buffer >> >> in order, I suspect it can help in performance. >> >> >> >> More below. >> >> >> >> >> >> >> >> Signed-off-by: Guo Zhi <qtxuning1999@xxxxxxxxxxx> >> >> --- >> >> drivers/vhost/vhost.c | 44 +++++++++++++++++++++++++++++++++++++++++-- >> >> drivers/vhost/vhost.h | 3 +++ >> >> 2 files changed, 45 insertions(+), 2 deletions(-) >> >> >> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> >> index 40097826c..e2e77e29f 100644 >> >> --- a/drivers/vhost/vhost.c >> >> +++ b/drivers/vhost/vhost.c >> >> @@ -317,6 +317,7 @@ static void vhost_vq_reset(struct vhost_dev *dev, >> >> vq->used_flags = 0; >> >> vq->log_used = false; >> >> vq->log_addr = -1ull; >> >> + vq->next_used_head_idx = 0; >> >> vq->private_data = NULL; >> >> vq->acked_features = 0; >> >> vq->acked_backend_features = 0; >> >> @@ -398,6 +399,8 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev) >> >> GFP_KERNEL); >> >> if (!vq->indirect || !vq->log || !vq->heads) >> >> goto err_nomem; >> >> + >> >> + memset(vq->heads, 0, sizeof(*vq->heads) * dev->iov_limit); >> >> } >> >> return 0; >> >> @@ -2374,12 +2377,49 @@ static int __vhost_add_used_n(struct vhost_virtqueue >> >> *vq, >> >> unsigned count) >> >> { >> >> vring_used_elem_t __user *used; >> >> + struct vring_desc desc; >> >> u16 old, new; >> >> int start; >> >> + int begin, end, i; >> >> + int copy_n = count; >> >> + >> >> + if (vhost_has_feature(vq, VIRTIO_F_IN_ORDER)) { >> >> >> >> >> >> >> >> How do you guarantee that ids of heads are contiguous? >> >> >> >> There is no need to be contiguous for ids of heads. >> >> >> >> For example, I have three buffer { .id = 0, 15}, {.id = 20, 30} {.id = 15, 20} >> >> for vhost_add_used_n. Then I will let the vq->heads[0].len=15. >> >> vq->heads[15].len=5, vq->heads[20].len=10 as reorder. Once I found there is no >> >> hold in the batched descriptors. I will expose them to driver. >> > >> > So spec said: >> > >> > "If VIRTIO_F_IN_ORDER has been negotiated, driver uses descriptors in >> > ring order: starting from offset 0 in the table, and wrapping around >> > at the end of the table." >> > >> > And >> > >> > "VIRTIO_F_IN_ORDER(35)This feature indicates that all buffers are used >> > by the device in the same order in which they have been made >> > available." >> > >> > This means your example is not an IN_ORDER device. >> > >> > The driver should submit buffers (assuming each buffer have one >> > descriptor) in order {id = 0, 15}, {id = 1, 30} and {id = 2, 20}. >> > >> > And even if it is submitted in order, we can not use a batch because: >> > >> > "The skipped buffers (for which no used ring entry was written) are >> > assumed to have been used (read or written) by the device completely." >> > >> > This means for TX we are probably ok, but for rx, unless we know the >> > buffers were written completely, we can't write them in a batch. >> > >> > I'd suggest to do cross testing for this series: >> > >> > 1) testing vhost IN_ORDER support with DPDK virtio PMD >> > 2) testing virtio IN_ORDER with DPDK vhost-user via testpmd >> > >> > Thanks >> > >> You are correct, for rx we can't do a batch because we have to let the driver >> know the length of buffers. > > Note that we can do a batch for rx when we know all the buffers have > been fully written. > >> >> I think these circumstances can offer batch: >> 1. tx >> 2. rx with RX_MRGBUF feature, which introduce a header for each received buffer >> >> Consider batch is not a mandatory requirement for in order feature according to >> spec. >> I'd like to let current RFC patch focus on in order implementation, and send >> another >> patch series to improve performance by batching on above circumstances. > > That's fine, how about simply starting from the patch that offers > IN_ORDER when zerocopy is disabled? > Yeah, I'd like to start from vsock device, which doesn't use zerocopy Thanks > Thanks > >> >> What's your opinon. >> >> Thanks >> > >> >> >> >> >> >> + /* calculate descriptor chain length for each used buffer */ >> >> >> >> >> >> >> >> I'm a little bit confused about this comment, we have heads[i].len for this? >> >> >> >> Maybe I should not use vq->heads, some misleading. >> >> >> >> >> >> + for (i = 0; i < count; i++) { >> >> + begin = heads[i].id; >> >> + end = begin; >> >> + vq->heads[begin].len = 0; >> >> >> >> >> >> >> >> Does this work for e.g RX virtqueue? >> >> >> >> >> >> + do { >> >> + vq->heads[begin].len += 1; >> >> + if (unlikely(vhost_get_desc(vq, &desc, end))) { >> >> >> >> >> >> >> >> Let's try hard to avoid more userspace copy here, it's the source of performance >> >> regression. >> >> >> >> Thanks >> >> >> >> >> >> + vq_err(vq, "Failed to get descriptor: idx %d addr %p\n", >> >> + end, vq->desc + end); >> >> + return -EFAULT; >> >> + } >> >> + } while ((end = next_desc(vq, &desc)) != -1); >> >> + } >> >> + >> >> + count = 0; >> >> + /* sort and batch continuous used ring entry */ >> >> + while (vq->heads[vq->next_used_head_idx].len != 0) { >> >> + count++; >> >> + i = vq->next_used_head_idx; >> >> + vq->next_used_head_idx = (vq->next_used_head_idx + >> >> + vq->heads[vq->next_used_head_idx].len) >> >> + % vq->num; >> >> + vq->heads[i].len = 0; >> >> + } >> >> + /* only write out a single used ring entry with the id corresponding >> >> + * to the head entry of the descriptor chain describing the last buffer >> >> + * in the batch. >> >> + */ >> >> + heads[0].id = i; >> >> + copy_n = 1; >> >> + } >> >> start = vq->last_used_idx & (vq->num - 1); >> >> used = vq->used->ring + start; >> >> - if (vhost_put_used(vq, heads, start, count)) { >> >> + if (vhost_put_used(vq, heads, start, copy_n)) { >> >> vq_err(vq, "Failed to write used"); >> >> return -EFAULT; >> >> } >> >> @@ -2410,7 +2450,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct >> >> vring_used_elem *heads, >> >> start = vq->last_used_idx & (vq->num - 1); >> >> n = vq->num - start; >> >> - if (n < count) { >> >> + if (n < count && !vhost_has_feature(vq, VIRTIO_F_IN_ORDER)) { >> >> r = __vhost_add_used_n(vq, heads, n); >> >> if (r < 0) >> >> return r; >> >> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h >> >> index d9109107a..7b2c0fbb5 100644 >> >> --- a/drivers/vhost/vhost.h >> >> +++ b/drivers/vhost/vhost.h >> >> @@ -107,6 +107,9 @@ struct vhost_virtqueue { >> >> bool log_used; >> >> u64 log_addr; >> >> + /* Sort heads in order */ >> >> + u16 next_used_head_idx; >> >> + >> >> struct iovec iov[UIO_MAXIOV]; >> >> struct iovec iotlb_iov[64]; >> >> struct iovec *indirect; >> >> >> >> >> >> >>