RE: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@xxxxxxxxxx]
> Sent: Friday, March 1, 2024 7:53 PM
> To: wangyunjian <wangyunjian@xxxxxxxxxx>
> Cc: Paolo Abeni <pabeni@xxxxxxxxxx>; willemdebruijn.kernel@xxxxxxxxx;
> jasowang@xxxxxxxxxx; kuba@xxxxxxxxxx; bjorn@xxxxxxxxxx;
> magnus.karlsson@xxxxxxxxx; maciej.fijalkowski@xxxxxxxxx;
> jonathan.lemon@xxxxxxxxx; davem@xxxxxxxxxxxxx; bpf@xxxxxxxxxxxxxxx;
> netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx;
> virtualization@xxxxxxxxxxxxxxx; xudingke <xudingke@xxxxxxxxxx>; liwei (DT)
> <liwei395@xxxxxxxxxx>
> Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> 
> On Fri, Mar 01, 2024 at 11:45:52AM +0000, wangyunjian wrote:
> > > -----Original Message-----
> > > From: Paolo Abeni [mailto:pabeni@xxxxxxxxxx]
> > > Sent: Thursday, February 29, 2024 7:13 PM
> > > To: wangyunjian <wangyunjian@xxxxxxxxxx>; mst@xxxxxxxxxx;
> > > willemdebruijn.kernel@xxxxxxxxx; jasowang@xxxxxxxxxx;
> > > kuba@xxxxxxxxxx; bjorn@xxxxxxxxxx; magnus.karlsson@xxxxxxxxx;
> > > maciej.fijalkowski@xxxxxxxxx; jonathan.lemon@xxxxxxxxx;
> > > davem@xxxxxxxxxxxxx
> > > Cc: bpf@xxxxxxxxxxxxxxx; netdev@xxxxxxxxxxxxxxx;
> > > linux-kernel@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx;
> > > virtualization@xxxxxxxxxxxxxxx; xudingke <xudingke@xxxxxxxxxx>;
> > > liwei (DT) <liwei395@xxxxxxxxxx>
> > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > support
> > >
> > > On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > > > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
> > > >  	}
> > > >  }
> > > >
> > > > +static void tun_peek_xsk(struct tun_file *tfile) {
> > > > +	struct xsk_buff_pool *pool;
> > > > +	u32 i, batch, budget;
> > > > +	void *frame;
> > > > +
> > > > +	if (!ptr_ring_empty(&tfile->tx_ring))
> > > > +		return;
> > > > +
> > > > +	spin_lock(&tfile->pool_lock);
> > > > +	pool = tfile->xsk_pool;
> > > > +	if (!pool) {
> > > > +		spin_unlock(&tfile->pool_lock);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	if (tfile->nb_descs) {
> > > > +		xsk_tx_completed(pool, tfile->nb_descs);
> > > > +		if (xsk_uses_need_wakeup(pool))
> > > > +			xsk_set_tx_need_wakeup(pool);
> > > > +	}
> > > > +
> > > > +	spin_lock(&tfile->tx_ring.producer_lock);
> > > > +	budget = min_t(u32, tfile->tx_ring.size, TUN_XDP_BATCH);
> > > > +
> > > > +	batch = xsk_tx_peek_release_desc_batch(pool, budget);
> > > > +	if (!batch) {
> > >
> > > This branch looks like an unneeded "optimization". The generic loop
> > > below should have the same effect with no measurable perf delta - and
> smaller code.
> > > Just remove this.
> > >
> > > > +		tfile->nb_descs = 0;
> > > > +		spin_unlock(&tfile->tx_ring.producer_lock);
> > > > +		spin_unlock(&tfile->pool_lock);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	tfile->nb_descs = batch;
> > > > +	for (i = 0; i < batch; i++) {
> > > > +		/* Encode the XDP DESC flag into lowest bit for consumer to
> differ
> > > > +		 * XDP desc from XDP buffer and sk_buff.
> > > > +		 */
> > > > +		frame = tun_xdp_desc_to_ptr(&pool->tx_descs[i]);
> > > > +		/* The budget must be less than or equal to tx_ring.size,
> > > > +		 * so enqueuing will not fail.
> > > > +		 */
> > > > +		__ptr_ring_produce(&tfile->tx_ring, frame);
> > > > +	}
> > > > +	spin_unlock(&tfile->tx_ring.producer_lock);
> > > > +	spin_unlock(&tfile->pool_lock);
> > >
> > > More related to the general design: it looks wrong. What if
> > > get_rx_bufs() will fail (ENOBUF) after successful peeking? With no
> > > more incoming packets, later peek will return 0 and it looks like
> > > that the half-processed packets will stay in the ring forever???
> > >
> > > I think the 'ring produce' part should be moved into tun_do_read().
> >
> > Currently, the vhost-net obtains a batch descriptors/sk_buffs from the
> > ptr_ring and enqueue the batch descriptors/sk_buffs to the
> > virtqueue'queue, and then consumes the descriptors/sk_buffs from the
> > virtqueue'queue in sequence. As a result, TUN does not know whether
> > the batch descriptors have been used up, and thus does not know when to
> return the batch descriptors.
> >
> > So, I think it's reasonable that when vhost-net checks ptr_ring is
> > empty, it calls peek_len to get new xsk's descs and return the descriptors.
> >
> > Thanks
> 
> What you need to think about is that if you peek, another call in parallel can get
> the same value at the same time.

Thank you. I have identified a problem. The tx_descs array was created within xsk's pool.
When xsk is freed, the pool and tx_descs are also freed. Howerver, some descs may
remain in the virtqueue'queue, which could lead to a use-after-free scenario. Currently,
I do not have an idea to solve this concurrency problem and believe this scenario may
not be appropriate for reusing the ptr_ring.

Thanks

> 
> 
> > >
> > > Cheers,
> > >
> > > Paolo
> >






[Index of Archives]     [KVM Development]     [Libvirt Development]     [Libvirt Users]     [CentOS Virtualization]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux