Re: [PATCH net-next 4/5] tsnep: Add XDP socket zero-copy RX support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07.04.23 15:41, Maciej Fijalkowski wrote:
On Wed, Apr 05, 2023 at 09:13:58PM +0200, Gerhard Engleder wrote:
On 03.04.23 19:26, Maciej Fijalkowski wrote:
On Mon, Apr 03, 2023 at 07:19:15PM +0200, Maciej Fijalkowski wrote:
On Sun, Apr 02, 2023 at 09:38:37PM +0200, Gerhard Engleder wrote:

Hey Gerhard,

Add support for XSK zero-copy to RX path. The setup of the XSK pool can
be done at runtime. If the netdev is running, then the queue must be
disabled and enabled during reconfiguration. This can be done easily
with functions introduced in previous commits.

A more important property is that, if the netdev is running, then the
setup of the XSK pool shall not stop the netdev in case of errors. A
broken netdev after a failed XSK pool setup is bad behavior. Therefore,
the allocation and setup of resources during XSK pool setup is done only
before any queue is disabled. Additionally, freeing and later allocation
of resources is eliminated in some cases. Page pool entries are kept for
later use. Two memory models are registered in parallel. As a result,
the XSK pool setup cannot fail during queue reconfiguration.

In contrast to other drivers, XSK pool setup and XDP BPF program setup
are separate actions. XSK pool setup can be done without any XDP BPF
program. The XDP BPF program can be added, removed or changed without
any reconfiguration of the XSK pool.

I won't argue about your design, but I'd be glad if you would present any
perf numbers (ZC vs copy mode) just to give us some overview how your
implementation works out. Also, please consider using batching APIs and
see if this gives you any boost (my assumption is that it would).


Signed-off-by: Gerhard Engleder <gerhard@xxxxxxxxxxxxxxxxxxxxx>
---
   drivers/net/ethernet/engleder/tsnep.h      |   7 +
   drivers/net/ethernet/engleder/tsnep_main.c | 432 ++++++++++++++++++++-
   drivers/net/ethernet/engleder/tsnep_xdp.c  |  67 ++++
   3 files changed, 488 insertions(+), 18 deletions(-)

(...)

+
   static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog,
   			       struct xdp_buff *xdp, int *status,
-			       struct netdev_queue *tx_nq, struct tsnep_tx *tx)
+			       struct netdev_queue *tx_nq, struct tsnep_tx *tx,
+			       bool zc)
   {
   	unsigned int length;
-	unsigned int sync;
   	u32 act;
   	length = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM;
   	act = bpf_prog_run_xdp(prog, xdp);
-
-	/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
-	sync = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM;
-	sync = max(sync, length);
-
   	switch (act) {
   	case XDP_PASS:
   		return false;
@@ -1027,8 +1149,21 @@ static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog,
   		trace_xdp_exception(rx->adapter->netdev, prog, act);
   		fallthrough;
   	case XDP_DROP:
-		page_pool_put_page(rx->page_pool, virt_to_head_page(xdp->data),
-				   sync, true);
+		if (zc) {
+			xsk_buff_free(xdp);
+		} else {
+			unsigned int sync;
+
+			/* Due xdp_adjust_tail: DMA sync for_device cover max
+			 * len CPU touch
+			 */
+			sync = xdp->data_end - xdp->data_hard_start -
+			       XDP_PACKET_HEADROOM;
+			sync = max(sync, length);
+			page_pool_put_page(rx->page_pool,
+					   virt_to_head_page(xdp->data), sync,
+					   true);
+		}
   		return true;
   	}
   }
@@ -1181,7 +1316,8 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi,
   					 length, false);
   			consume = tsnep_xdp_run_prog(rx, prog, &xdp,
-						     &xdp_status, tx_nq, tx);
+						     &xdp_status, tx_nq, tx,
+						     false);
   			if (consume) {
   				rx->packets++;
   				rx->bytes += length;
@@ -1205,6 +1341,125 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi,
   	return done;
   }
+static int tsnep_rx_poll_zc(struct tsnep_rx *rx, struct napi_struct *napi,
+			    int budget)
+{
+	struct tsnep_rx_entry *entry;
+	struct netdev_queue *tx_nq;
+	struct bpf_prog *prog;
+	struct tsnep_tx *tx;
+	int desc_available;
+	int xdp_status = 0;
+	struct page *page;
+	int done = 0;
+	int length;
+
+	desc_available = tsnep_rx_desc_available(rx);
+	prog = READ_ONCE(rx->adapter->xdp_prog);
+	if (prog) {
+		tx_nq = netdev_get_tx_queue(rx->adapter->netdev,
+					    rx->tx_queue_index);
+		tx = &rx->adapter->tx[rx->tx_queue_index];
+	}
+
+	while (likely(done < budget) && (rx->read != rx->write)) {
+		entry = &rx->entry[rx->read];
+		if ((__le32_to_cpu(entry->desc_wb->properties) &
+		     TSNEP_DESC_OWNER_COUNTER_MASK) !=
+		    (entry->properties & TSNEP_DESC_OWNER_COUNTER_MASK))
+			break;
+		done++;
+
+		if (desc_available >= TSNEP_RING_RX_REFILL) {
+			bool reuse = desc_available >= TSNEP_RING_RX_REUSE;
+
+			desc_available -= tsnep_rx_refill_zc(rx, desc_available,
+							     reuse);
+			if (!entry->xdp) {
+				/* buffer has been reused for refill to prevent
+				 * empty RX ring, thus buffer cannot be used for
+				 * RX processing
+				 */
+				rx->read = (rx->read + 1) % TSNEP_RING_SIZE;
+				desc_available++;
+
+				rx->dropped++;
+
+				continue;
+			}
+		}
+
+		/* descriptor properties shall be read first, because valid data
+		 * is signaled there
+		 */
+		dma_rmb();
+
+		prefetch(entry->xdp->data);
+		length = __le32_to_cpu(entry->desc_wb->properties) &
+			 TSNEP_DESC_LENGTH_MASK;
+		entry->xdp->data_end = entry->xdp->data + length;
+		xsk_buff_dma_sync_for_cpu(entry->xdp, rx->xsk_pool);
+
+		/* RX metadata with timestamps is in front of actual data,
+		 * subtract metadata size to get length of actual data and
+		 * consider metadata size as offset of actual data during RX
+		 * processing
+		 */
+		length -= TSNEP_RX_INLINE_METADATA_SIZE;
+
+		rx->read = (rx->read + 1) % TSNEP_RING_SIZE;
+		desc_available++;
+
+		if (prog) {
+			bool consume;
+
+			entry->xdp->data += TSNEP_RX_INLINE_METADATA_SIZE;
+			entry->xdp->data_meta += TSNEP_RX_INLINE_METADATA_SIZE;
+
+			consume = tsnep_xdp_run_prog(rx, prog, entry->xdp,
+						     &xdp_status, tx_nq, tx,
+						     true);

reason for separate xdp run prog routine for ZC was usually "likely-fying"
XDP_REDIRECT action as this is the main action for AF_XDP which was giving
us perf improvement. Please try this out on your side to see if this
yields any positive value.

One more thing - you have to handle XDP_TX action in a ZC specific way.
Your current code will break if you enable xsk_pool and return XDP_TX from
XDP prog.

I took again a look to igc, but I didn't found any specifics for XDP_TX
ZC. Only some buffer flipping, which I assume is needed for shared
pages.
For ice I see a call to xdp_convert_buff_to_frame() in ZC path, which
has some XSK logic within. Is this the ZC specific way? igc calls
xdp_convert_buff_to_frame() in both cases, so I'm not sure. But I will
try the XDP_TX action. I did test only with xdpsock.

I think I will back off a bit with a statement that your XDP_TX is clearly
broken, here's why.

igc when converting xdp_buff to xdp_frame and xdp_buff's memory being
backed by xsk_buff_pool will grab the new page from kernel, copy the
contents of xdp_buff to it, recycle xdp_buff back to xsk_buff_pool and
return new page back to driver (i have just described what
xdp_convert_zc_to_xdp_frame() is doing). Thing is that it is expensive and
hurts perf and we stepped away from this on ice, this is a matter of
storing the xdp_buff onto adequate Tx buffer struct that you can access
while cleaning Tx descriptors so that you'll be able to xsk_buff_free()
it.

So saying 'your current code will break' might have been too much from my
side. Just make sure that your XDP_TX action on ZC works. In order to do
that, i was loading xdpsock with XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD flag
on a queue receiving frames and then running xdp_rxq_info with XDP_TX
action.

I took a look to the new ice code. It makes sense to prevent that
copying. I took a note for future work. For now XDP_REDIRECT is
the optimisation path.

Tested as suggested. XDP_TX works.



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux