On Tue, Apr 09, 2024 at 01:02:52AM +0800, ni.liqiang wrote: > In our testing of the virtio hardware accelerator, we found that > configuring the flags of the descriptor after addr and len, > as implemented in DPDK, seems to be more friendly to the hardware. > > In our Virtio hardware implementation tests, using the default > open-source code, the hardware's bulk reads ensure performance > but correctness is compromised. If we refer to the implementation code > of DPDK, placing the flags configuration of the descriptor > after addr and len, virtio backend can function properly based on > our hardware accelerator. > > I am somewhat puzzled by this. From a software process perspective, > it seems that there should be no difference whether > the flags configuration of the descriptor is before or after addr and len. > However, this is not the case according to experimental test results. You should be aware of the following, from the PCI Express spec. Note especially the second paragraph, and the last paragraph: 2.4.2. 25 30 Update Ordering and Granularity Observed by a Read Transaction If a Requester using a single transaction reads a block of data from a Completer, and the Completer's data buffer is concurrently being updated, the ordering of multiple updates and granularity of each update reflected in the data returned by the read is outside the scope of this specification. This applies both to updates performed by PCI Express write transactions and updates performed by other mechanisms such as host CPUs updating host memory. If a Requester using a single transaction reads a block of data from a Completer, and the Completer's data buffer is concurrently being updated by one or more entities not on the PCI Express fabric, the ordering of multiple updates and granularity of each update reflected in the data returned by the read is outside the scope of this specification. As an example of update ordering, assume that the block of data is in host memory, and a host CPU writes first to location A and then to a different location B. A Requester reading that data block with a single read transaction is not guaranteed to observe those updates in order. In other words, the Requester may observe an updated value in location B and an old value in location A, regardless of the placement of locations A and B within the data block. Unless a Completer makes its own guarantees (outside this specification) with respect to update ordering, a Requester that relies on update ordering must observe the update to location B via one read transaction before initiating a subsequent read to location A to return its updated value. As an example of update granularity, if a host CPU writes a QWORD to host memory, a Requester reading that QWORD from host memory may observe a portion of the QWORD updated and another portion of it containing the old value. While not required by this specification, it is strongly recommended that host platforms guarantee that when a host CPU writes aligned DWORDs or aligned QWORDs to host memory, the update granularity observed by a PCI Express read will not be smaller than a DWORD. IMPLEMENTATION NOTE No Ordering Required Between Cachelines 15 A Root Complex serving as a Completer to a single Memory Read that requests multiple cachelines from host memory is permitted to fetch multiple cachelines concurrently, to help facilitate multi- cacheline completions, subject to Max_Payload_Size. No ordering relationship between these cacheline fetches is required. Now I suspect that what is going on is that your Root complex reads descriptors out of order, so the second descriptor is invalid but the 1st one is valid. > We would like to know if such a change in the configuration order > is reasonable and acceptable? We need to understand the root cause and how robust the fix is before answering this. > Thanks. > > Signed-off-by: ni.liqiang <ni_liqiang@xxxxxxx> > Reviewed-by: jin.qi <jin.qi@xxxxxxxxxx> > Tested-by: jin.qi <jin.qi@xxxxxxxxxx> > Cc: ni.liqiang <ni.liqiang@xxxxxxxxxx> > --- > drivers/virtio/virtio_ring.c | 9 +++++---- > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index 6f7e5010a673..bea2c2fb084e 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -1472,15 +1472,16 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq, > flags = cpu_to_le16(vq->packed.avail_used_flags | > (++c == total_sg ? 0 : VRING_DESC_F_NEXT) | > (n < out_sgs ? 0 : VRING_DESC_F_WRITE)); > - if (i == head) > - head_flags = flags; > - else > - desc[i].flags = flags; > > desc[i].addr = cpu_to_le64(addr); > desc[i].len = cpu_to_le32(sg->length); > desc[i].id = cpu_to_le16(id); > > + if (i == head) > + head_flags = flags; > + else > + desc[i].flags = flags; > + > if (unlikely(vq->use_dma_api)) { > vq->packed.desc_extra[curr].addr = addr; > vq->packed.desc_extra[curr].len = sg->length; > -- > 2.34.1