Re: [PATCH net-next 4/5] tsnep: Add XDP socket zero-copy RX support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 03.04.23 19:26, Maciej Fijalkowski wrote:
On Mon, Apr 03, 2023 at 07:19:15PM +0200, Maciej Fijalkowski wrote:
On Sun, Apr 02, 2023 at 09:38:37PM +0200, Gerhard Engleder wrote:

Hey Gerhard,

Add support for XSK zero-copy to RX path. The setup of the XSK pool can
be done at runtime. If the netdev is running, then the queue must be
disabled and enabled during reconfiguration. This can be done easily
with functions introduced in previous commits.

A more important property is that, if the netdev is running, then the
setup of the XSK pool shall not stop the netdev in case of errors. A
broken netdev after a failed XSK pool setup is bad behavior. Therefore,
the allocation and setup of resources during XSK pool setup is done only
before any queue is disabled. Additionally, freeing and later allocation
of resources is eliminated in some cases. Page pool entries are kept for
later use. Two memory models are registered in parallel. As a result,
the XSK pool setup cannot fail during queue reconfiguration.

In contrast to other drivers, XSK pool setup and XDP BPF program setup
are separate actions. XSK pool setup can be done without any XDP BPF
program. The XDP BPF program can be added, removed or changed without
any reconfiguration of the XSK pool.

I won't argue about your design, but I'd be glad if you would present any
perf numbers (ZC vs copy mode) just to give us some overview how your
implementation works out. Also, please consider using batching APIs and
see if this gives you any boost (my assumption is that it would).


Signed-off-by: Gerhard Engleder <gerhard@xxxxxxxxxxxxxxxxxxxxx>
---
  drivers/net/ethernet/engleder/tsnep.h      |   7 +
  drivers/net/ethernet/engleder/tsnep_main.c | 432 ++++++++++++++++++++-
  drivers/net/ethernet/engleder/tsnep_xdp.c  |  67 ++++
  3 files changed, 488 insertions(+), 18 deletions(-)

(...)

+
  static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog,
  			       struct xdp_buff *xdp, int *status,
-			       struct netdev_queue *tx_nq, struct tsnep_tx *tx)
+			       struct netdev_queue *tx_nq, struct tsnep_tx *tx,
+			       bool zc)
  {
  	unsigned int length;
-	unsigned int sync;
  	u32 act;
length = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM; act = bpf_prog_run_xdp(prog, xdp);
-
-	/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
-	sync = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM;
-	sync = max(sync, length);
-
  	switch (act) {
  	case XDP_PASS:
  		return false;
@@ -1027,8 +1149,21 @@ static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog,
  		trace_xdp_exception(rx->adapter->netdev, prog, act);
  		fallthrough;
  	case XDP_DROP:
-		page_pool_put_page(rx->page_pool, virt_to_head_page(xdp->data),
-				   sync, true);
+		if (zc) {
+			xsk_buff_free(xdp);
+		} else {
+			unsigned int sync;
+
+			/* Due xdp_adjust_tail: DMA sync for_device cover max
+			 * len CPU touch
+			 */
+			sync = xdp->data_end - xdp->data_hard_start -
+			       XDP_PACKET_HEADROOM;
+			sync = max(sync, length);
+			page_pool_put_page(rx->page_pool,
+					   virt_to_head_page(xdp->data), sync,
+					   true);
+		}
  		return true;
  	}
  }
@@ -1181,7 +1316,8 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi,
  					 length, false);
consume = tsnep_xdp_run_prog(rx, prog, &xdp,
-						     &xdp_status, tx_nq, tx);
+						     &xdp_status, tx_nq, tx,
+						     false);
  			if (consume) {
  				rx->packets++;
  				rx->bytes += length;
@@ -1205,6 +1341,125 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi,
  	return done;
  }
+static int tsnep_rx_poll_zc(struct tsnep_rx *rx, struct napi_struct *napi,
+			    int budget)
+{
+	struct tsnep_rx_entry *entry;
+	struct netdev_queue *tx_nq;
+	struct bpf_prog *prog;
+	struct tsnep_tx *tx;
+	int desc_available;
+	int xdp_status = 0;
+	struct page *page;
+	int done = 0;
+	int length;
+
+	desc_available = tsnep_rx_desc_available(rx);
+	prog = READ_ONCE(rx->adapter->xdp_prog);
+	if (prog) {
+		tx_nq = netdev_get_tx_queue(rx->adapter->netdev,
+					    rx->tx_queue_index);
+		tx = &rx->adapter->tx[rx->tx_queue_index];
+	}
+
+	while (likely(done < budget) && (rx->read != rx->write)) {
+		entry = &rx->entry[rx->read];
+		if ((__le32_to_cpu(entry->desc_wb->properties) &
+		     TSNEP_DESC_OWNER_COUNTER_MASK) !=
+		    (entry->properties & TSNEP_DESC_OWNER_COUNTER_MASK))
+			break;
+		done++;
+
+		if (desc_available >= TSNEP_RING_RX_REFILL) {
+			bool reuse = desc_available >= TSNEP_RING_RX_REUSE;
+
+			desc_available -= tsnep_rx_refill_zc(rx, desc_available,
+							     reuse);
+			if (!entry->xdp) {
+				/* buffer has been reused for refill to prevent
+				 * empty RX ring, thus buffer cannot be used for
+				 * RX processing
+				 */
+				rx->read = (rx->read + 1) % TSNEP_RING_SIZE;
+				desc_available++;
+
+				rx->dropped++;
+
+				continue;
+			}
+		}
+
+		/* descriptor properties shall be read first, because valid data
+		 * is signaled there
+		 */
+		dma_rmb();
+
+		prefetch(entry->xdp->data);
+		length = __le32_to_cpu(entry->desc_wb->properties) &
+			 TSNEP_DESC_LENGTH_MASK;
+		entry->xdp->data_end = entry->xdp->data + length;
+		xsk_buff_dma_sync_for_cpu(entry->xdp, rx->xsk_pool);
+
+		/* RX metadata with timestamps is in front of actual data,
+		 * subtract metadata size to get length of actual data and
+		 * consider metadata size as offset of actual data during RX
+		 * processing
+		 */
+		length -= TSNEP_RX_INLINE_METADATA_SIZE;
+
+		rx->read = (rx->read + 1) % TSNEP_RING_SIZE;
+		desc_available++;
+
+		if (prog) {
+			bool consume;
+
+			entry->xdp->data += TSNEP_RX_INLINE_METADATA_SIZE;
+			entry->xdp->data_meta += TSNEP_RX_INLINE_METADATA_SIZE;
+
+			consume = tsnep_xdp_run_prog(rx, prog, entry->xdp,
+						     &xdp_status, tx_nq, tx,
+						     true);

reason for separate xdp run prog routine for ZC was usually "likely-fying"
XDP_REDIRECT action as this is the main action for AF_XDP which was giving
us perf improvement. Please try this out on your side to see if this
yields any positive value.

One more thing - you have to handle XDP_TX action in a ZC specific way.
Your current code will break if you enable xsk_pool and return XDP_TX from
XDP prog.

I took again a look to igc, but I didn't found any specifics for XDP_TX
ZC. Only some buffer flipping, which I assume is needed for shared
pages.
For ice I see a call to xdp_convert_buff_to_frame() in ZC path, which
has some XSK logic within. Is this the ZC specific way? igc calls
xdp_convert_buff_to_frame() in both cases, so I'm not sure. But I will
try the XDP_TX action. I did test only with xdpsock.



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux