On 07.04.23 15:41, Maciej Fijalkowski wrote:
On Wed, Apr 05, 2023 at 09:13:58PM +0200, Gerhard Engleder wrote:On 03.04.23 19:26, Maciej Fijalkowski wrote:On Mon, Apr 03, 2023 at 07:19:15PM +0200, Maciej Fijalkowski wrote:On Sun, Apr 02, 2023 at 09:38:37PM +0200, Gerhard Engleder wrote: Hey Gerhard,Add support for XSK zero-copy to RX path. The setup of the XSK pool can be done at runtime. If the netdev is running, then the queue must be disabled and enabled during reconfiguration. This can be done easily with functions introduced in previous commits. A more important property is that, if the netdev is running, then the setup of the XSK pool shall not stop the netdev in case of errors. A broken netdev after a failed XSK pool setup is bad behavior. Therefore, the allocation and setup of resources during XSK pool setup is done only before any queue is disabled. Additionally, freeing and later allocation of resources is eliminated in some cases. Page pool entries are kept for later use. Two memory models are registered in parallel. As a result, the XSK pool setup cannot fail during queue reconfiguration. In contrast to other drivers, XSK pool setup and XDP BPF program setup are separate actions. XSK pool setup can be done without any XDP BPF program. The XDP BPF program can be added, removed or changed without any reconfiguration of the XSK pool.I won't argue about your design, but I'd be glad if you would present any perf numbers (ZC vs copy mode) just to give us some overview how your implementation works out. Also, please consider using batching APIs and see if this gives you any boost (my assumption is that it would).Signed-off-by: Gerhard Engleder <gerhard@xxxxxxxxxxxxxxxxxxxxx> --- drivers/net/ethernet/engleder/tsnep.h | 7 + drivers/net/ethernet/engleder/tsnep_main.c | 432 ++++++++++++++++++++- drivers/net/ethernet/engleder/tsnep_xdp.c | 67 ++++ 3 files changed, 488 insertions(+), 18 deletions(-)(...)+ static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog, struct xdp_buff *xdp, int *status, - struct netdev_queue *tx_nq, struct tsnep_tx *tx) + struct netdev_queue *tx_nq, struct tsnep_tx *tx, + bool zc) { unsigned int length; - unsigned int sync; u32 act; length = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM; act = bpf_prog_run_xdp(prog, xdp); - - /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */ - sync = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM; - sync = max(sync, length); - switch (act) { case XDP_PASS: return false; @@ -1027,8 +1149,21 @@ static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog, trace_xdp_exception(rx->adapter->netdev, prog, act); fallthrough; case XDP_DROP: - page_pool_put_page(rx->page_pool, virt_to_head_page(xdp->data), - sync, true); + if (zc) { + xsk_buff_free(xdp); + } else { + unsigned int sync; + + /* Due xdp_adjust_tail: DMA sync for_device cover max + * len CPU touch + */ + sync = xdp->data_end - xdp->data_hard_start - + XDP_PACKET_HEADROOM; + sync = max(sync, length); + page_pool_put_page(rx->page_pool, + virt_to_head_page(xdp->data), sync, + true); + } return true; } } @@ -1181,7 +1316,8 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi, length, false); consume = tsnep_xdp_run_prog(rx, prog, &xdp, - &xdp_status, tx_nq, tx); + &xdp_status, tx_nq, tx, + false); if (consume) { rx->packets++; rx->bytes += length; @@ -1205,6 +1341,125 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi, return done; } +static int tsnep_rx_poll_zc(struct tsnep_rx *rx, struct napi_struct *napi, + int budget) +{ + struct tsnep_rx_entry *entry; + struct netdev_queue *tx_nq; + struct bpf_prog *prog; + struct tsnep_tx *tx; + int desc_available; + int xdp_status = 0; + struct page *page; + int done = 0; + int length; + + desc_available = tsnep_rx_desc_available(rx); + prog = READ_ONCE(rx->adapter->xdp_prog); + if (prog) { + tx_nq = netdev_get_tx_queue(rx->adapter->netdev, + rx->tx_queue_index); + tx = &rx->adapter->tx[rx->tx_queue_index]; + } + + while (likely(done < budget) && (rx->read != rx->write)) { + entry = &rx->entry[rx->read]; + if ((__le32_to_cpu(entry->desc_wb->properties) & + TSNEP_DESC_OWNER_COUNTER_MASK) != + (entry->properties & TSNEP_DESC_OWNER_COUNTER_MASK)) + break; + done++; + + if (desc_available >= TSNEP_RING_RX_REFILL) { + bool reuse = desc_available >= TSNEP_RING_RX_REUSE; + + desc_available -= tsnep_rx_refill_zc(rx, desc_available, + reuse); + if (!entry->xdp) { + /* buffer has been reused for refill to prevent + * empty RX ring, thus buffer cannot be used for + * RX processing + */ + rx->read = (rx->read + 1) % TSNEP_RING_SIZE; + desc_available++; + + rx->dropped++; + + continue; + } + } + + /* descriptor properties shall be read first, because valid data + * is signaled there + */ + dma_rmb(); + + prefetch(entry->xdp->data); + length = __le32_to_cpu(entry->desc_wb->properties) & + TSNEP_DESC_LENGTH_MASK; + entry->xdp->data_end = entry->xdp->data + length; + xsk_buff_dma_sync_for_cpu(entry->xdp, rx->xsk_pool); + + /* RX metadata with timestamps is in front of actual data, + * subtract metadata size to get length of actual data and + * consider metadata size as offset of actual data during RX + * processing + */ + length -= TSNEP_RX_INLINE_METADATA_SIZE; + + rx->read = (rx->read + 1) % TSNEP_RING_SIZE; + desc_available++; + + if (prog) { + bool consume; + + entry->xdp->data += TSNEP_RX_INLINE_METADATA_SIZE; + entry->xdp->data_meta += TSNEP_RX_INLINE_METADATA_SIZE; + + consume = tsnep_xdp_run_prog(rx, prog, entry->xdp, + &xdp_status, tx_nq, tx, + true);reason for separate xdp run prog routine for ZC was usually "likely-fying" XDP_REDIRECT action as this is the main action for AF_XDP which was giving us perf improvement. Please try this out on your side to see if this yields any positive value.One more thing - you have to handle XDP_TX action in a ZC specific way. Your current code will break if you enable xsk_pool and return XDP_TX from XDP prog.I took again a look to igc, but I didn't found any specifics for XDP_TX ZC. Only some buffer flipping, which I assume is needed for shared pages. For ice I see a call to xdp_convert_buff_to_frame() in ZC path, which has some XSK logic within. Is this the ZC specific way? igc calls xdp_convert_buff_to_frame() in both cases, so I'm not sure. But I will try the XDP_TX action. I did test only with xdpsock.I think I will back off a bit with a statement that your XDP_TX is clearly broken, here's why. igc when converting xdp_buff to xdp_frame and xdp_buff's memory being backed by xsk_buff_pool will grab the new page from kernel, copy the contents of xdp_buff to it, recycle xdp_buff back to xsk_buff_pool and return new page back to driver (i have just described what xdp_convert_zc_to_xdp_frame() is doing). Thing is that it is expensive and hurts perf and we stepped away from this on ice, this is a matter of storing the xdp_buff onto adequate Tx buffer struct that you can access while cleaning Tx descriptors so that you'll be able to xsk_buff_free() it. So saying 'your current code will break' might have been too much from my side. Just make sure that your XDP_TX action on ZC works. In order to do that, i was loading xdpsock with XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD flag on a queue receiving frames and then running xdp_rxq_info with XDP_TX action.
I took a look to the new ice code. It makes sense to prevent that copying. I took a note for future work. For now XDP_REDIRECT is the optimisation path. Tested as suggested. XDP_TX works.