On Mon, Mar 11, 2024 at 9:28 PM wangyunjian <wangyunjian@xxxxxxxxxx> wrote: > > > > > -----Original Message----- > > From: Jason Wang [mailto:jasowang@xxxxxxxxxx] > > Sent: Monday, March 11, 2024 12:01 PM > > To: wangyunjian <wangyunjian@xxxxxxxxxx> > > Cc: Michael S. Tsirkin <mst@xxxxxxxxxx>; Paolo Abeni <pabeni@xxxxxxxxxx>; > > willemdebruijn.kernel@xxxxxxxxx; kuba@xxxxxxxxxx; bjorn@xxxxxxxxxx; > > magnus.karlsson@xxxxxxxxx; maciej.fijalkowski@xxxxxxxxx; > > jonathan.lemon@xxxxxxxxx; davem@xxxxxxxxxxxxx; bpf@xxxxxxxxxxxxxxx; > > netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; > > virtualization@xxxxxxxxxxxxxxx; xudingke <xudingke@xxxxxxxxxx>; liwei (DT) > > <liwei395@xxxxxxxxxx> > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support > > > > On Mon, Mar 4, 2024 at 9:45 PM wangyunjian <wangyunjian@xxxxxxxxxx> > > wrote: > > > > > > > > > > > > > -----Original Message----- > > > > From: Michael S. Tsirkin [mailto:mst@xxxxxxxxxx] > > > > Sent: Friday, March 1, 2024 7:53 PM > > > > To: wangyunjian <wangyunjian@xxxxxxxxxx> > > > > Cc: Paolo Abeni <pabeni@xxxxxxxxxx>; > > > > willemdebruijn.kernel@xxxxxxxxx; jasowang@xxxxxxxxxx; > > > > kuba@xxxxxxxxxx; bjorn@xxxxxxxxxx; magnus.karlsson@xxxxxxxxx; > > > > maciej.fijalkowski@xxxxxxxxx; jonathan.lemon@xxxxxxxxx; > > > > davem@xxxxxxxxxxxxx; bpf@xxxxxxxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; > > > > linux-kernel@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; > > > > virtualization@xxxxxxxxxxxxxxx; xudingke <xudingke@xxxxxxxxxx>; > > > > liwei (DT) <liwei395@xxxxxxxxxx> > > > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy > > > > support > > > > > > > > On Fri, Mar 01, 2024 at 11:45:52AM +0000, wangyunjian wrote: > > > > > > -----Original Message----- > > > > > > From: Paolo Abeni [mailto:pabeni@xxxxxxxxxx] > > > > > > Sent: Thursday, February 29, 2024 7:13 PM > > > > > > To: wangyunjian <wangyunjian@xxxxxxxxxx>; mst@xxxxxxxxxx; > > > > > > willemdebruijn.kernel@xxxxxxxxx; jasowang@xxxxxxxxxx; > > > > > > kuba@xxxxxxxxxx; bjorn@xxxxxxxxxx; magnus.karlsson@xxxxxxxxx; > > > > > > maciej.fijalkowski@xxxxxxxxx; jonathan.lemon@xxxxxxxxx; > > > > > > davem@xxxxxxxxxxxxx > > > > > > Cc: bpf@xxxxxxxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; > > > > > > linux-kernel@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; > > > > > > virtualization@xxxxxxxxxxxxxxx; xudingke <xudingke@xxxxxxxxxx>; > > > > > > liwei (DT) <liwei395@xxxxxxxxxx> > > > > > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy > > > > > > support > > > > > > > > > > > > On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote: > > > > > > > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr) > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > +static void tun_peek_xsk(struct tun_file *tfile) { > > > > > > > + struct xsk_buff_pool *pool; > > > > > > > + u32 i, batch, budget; > > > > > > > + void *frame; > > > > > > > + > > > > > > > + if (!ptr_ring_empty(&tfile->tx_ring)) > > > > > > > + return; > > > > > > > + > > > > > > > + spin_lock(&tfile->pool_lock); > > > > > > > + pool = tfile->xsk_pool; > > > > > > > + if (!pool) { > > > > > > > + spin_unlock(&tfile->pool_lock); > > > > > > > + return; > > > > > > > + } > > > > > > > + > > > > > > > + if (tfile->nb_descs) { > > > > > > > + xsk_tx_completed(pool, tfile->nb_descs); > > > > > > > + if (xsk_uses_need_wakeup(pool)) > > > > > > > + xsk_set_tx_need_wakeup(pool); > > > > > > > + } > > > > > > > + > > > > > > > + spin_lock(&tfile->tx_ring.producer_lock); > > > > > > > + budget = min_t(u32, tfile->tx_ring.size, > > > > > > > + TUN_XDP_BATCH); > > > > > > > + > > > > > > > + batch = xsk_tx_peek_release_desc_batch(pool, budget); > > > > > > > + if (!batch) { > > > > > > > > > > > > This branch looks like an unneeded "optimization". The generic > > > > > > loop below should have the same effect with no measurable perf > > > > > > delta - and > > > > smaller code. > > > > > > Just remove this. > > > > > > > > > > > > > + tfile->nb_descs = 0; > > > > > > > + spin_unlock(&tfile->tx_ring.producer_lock); > > > > > > > + spin_unlock(&tfile->pool_lock); > > > > > > > + return; > > > > > > > + } > > > > > > > + > > > > > > > + tfile->nb_descs = batch; > > > > > > > + for (i = 0; i < batch; i++) { > > > > > > > + /* Encode the XDP DESC flag into lowest bit > > > > > > > + for consumer to > > > > differ > > > > > > > + * XDP desc from XDP buffer and sk_buff. > > > > > > > + */ > > > > > > > + frame = tun_xdp_desc_to_ptr(&pool->tx_descs[i]); > > > > > > > + /* The budget must be less than or equal to > > tx_ring.size, > > > > > > > + * so enqueuing will not fail. > > > > > > > + */ > > > > > > > + __ptr_ring_produce(&tfile->tx_ring, frame); > > > > > > > + } > > > > > > > + spin_unlock(&tfile->tx_ring.producer_lock); > > > > > > > + spin_unlock(&tfile->pool_lock); > > > > > > > > > > > > More related to the general design: it looks wrong. What if > > > > > > get_rx_bufs() will fail (ENOBUF) after successful peeking? With > > > > > > no more incoming packets, later peek will return 0 and it looks > > > > > > like that the half-processed packets will stay in the ring forever??? > > > > > > > > > > > > I think the 'ring produce' part should be moved into tun_do_read(). > > > > > > > > > > Currently, the vhost-net obtains a batch descriptors/sk_buffs from > > > > > the ptr_ring and enqueue the batch descriptors/sk_buffs to the > > > > > virtqueue'queue, and then consumes the descriptors/sk_buffs from > > > > > the virtqueue'queue in sequence. As a result, TUN does not know > > > > > whether the batch descriptors have been used up, and thus does not > > > > > know when to > > > > return the batch descriptors. > > > > > > > > > > So, I think it's reasonable that when vhost-net checks ptr_ring is > > > > > empty, it calls peek_len to get new xsk's descs and return the descriptors. > > > > > > > > > > Thanks > > > > > > > > What you need to think about is that if you peek, another call in > > > > parallel can get the same value at the same time. > > > > > > Thank you. I have identified a problem. The tx_descs array was created within > > xsk's pool. > > > When xsk is freed, the pool and tx_descs are also freed. Howerver, > > > some descs may remain in the virtqueue'queue, which could lead to a > > use-after-free scenario. > > > > This can probably solving by when xsk pool is disabled, signal the vhost_net to > > drop those descriptors. > > I think TUN can notify vhost_net to drop these descriptors through netdev events. Great, actually, the "issue" described above exist in this patch as well. For example, you did: spin_lock(&tfile->pool_lock); if (tfile->pool) { ret = tun_put_user_desc(tun, tfile, &tfile->desc, to); You did copy_to_user() under spinlock which is actually a bug. > However, there is a potential concurrency problem. When handling netdev events > and packets, vhost_net preempts the 'vq->mutex_lock', leading to unstable performance. I think we don't need to care the perf in this case. And we gain a lot: 1) no trick in peek 2) batching support ... Thanks > > Thanks > > > > Thanks > > > > > Currently, > > > I do not have an idea to solve this concurrency problem and believe > > > this scenario may not be appropriate for reusing the ptr_ring. > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > Paolo > > > > > > > > >