On Mon, Apr 03, 2023 at 07:19:15PM +0200, Maciej Fijalkowski wrote: > On Sun, Apr 02, 2023 at 09:38:37PM +0200, Gerhard Engleder wrote: > > Hey Gerhard, > > > Add support for XSK zero-copy to RX path. The setup of the XSK pool can > > be done at runtime. If the netdev is running, then the queue must be > > disabled and enabled during reconfiguration. This can be done easily > > with functions introduced in previous commits. > > > > A more important property is that, if the netdev is running, then the > > setup of the XSK pool shall not stop the netdev in case of errors. A > > broken netdev after a failed XSK pool setup is bad behavior. Therefore, > > the allocation and setup of resources during XSK pool setup is done only > > before any queue is disabled. Additionally, freeing and later allocation > > of resources is eliminated in some cases. Page pool entries are kept for > > later use. Two memory models are registered in parallel. As a result, > > the XSK pool setup cannot fail during queue reconfiguration. > > > > In contrast to other drivers, XSK pool setup and XDP BPF program setup > > are separate actions. XSK pool setup can be done without any XDP BPF > > program. The XDP BPF program can be added, removed or changed without > > any reconfiguration of the XSK pool. > > I won't argue about your design, but I'd be glad if you would present any > perf numbers (ZC vs copy mode) just to give us some overview how your > implementation works out. Also, please consider using batching APIs and > see if this gives you any boost (my assumption is that it would). > > > > > Signed-off-by: Gerhard Engleder <gerhard@xxxxxxxxxxxxxxxxxxxxx> > > --- > > drivers/net/ethernet/engleder/tsnep.h | 7 + > > drivers/net/ethernet/engleder/tsnep_main.c | 432 ++++++++++++++++++++- > > drivers/net/ethernet/engleder/tsnep_xdp.c | 67 ++++ > > 3 files changed, 488 insertions(+), 18 deletions(-) (...) > > + > > static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog, > > struct xdp_buff *xdp, int *status, > > - struct netdev_queue *tx_nq, struct tsnep_tx *tx) > > + struct netdev_queue *tx_nq, struct tsnep_tx *tx, > > + bool zc) > > { > > unsigned int length; > > - unsigned int sync; > > u32 act; > > > > length = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM; > > > > act = bpf_prog_run_xdp(prog, xdp); > > - > > - /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */ > > - sync = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM; > > - sync = max(sync, length); > > - > > switch (act) { > > case XDP_PASS: > > return false; > > @@ -1027,8 +1149,21 @@ static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog, > > trace_xdp_exception(rx->adapter->netdev, prog, act); > > fallthrough; > > case XDP_DROP: > > - page_pool_put_page(rx->page_pool, virt_to_head_page(xdp->data), > > - sync, true); > > + if (zc) { > > + xsk_buff_free(xdp); > > + } else { > > + unsigned int sync; > > + > > + /* Due xdp_adjust_tail: DMA sync for_device cover max > > + * len CPU touch > > + */ > > + sync = xdp->data_end - xdp->data_hard_start - > > + XDP_PACKET_HEADROOM; > > + sync = max(sync, length); > > + page_pool_put_page(rx->page_pool, > > + virt_to_head_page(xdp->data), sync, > > + true); > > + } > > return true; > > } > > } > > @@ -1181,7 +1316,8 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi, > > length, false); > > > > consume = tsnep_xdp_run_prog(rx, prog, &xdp, > > - &xdp_status, tx_nq, tx); > > + &xdp_status, tx_nq, tx, > > + false); > > if (consume) { > > rx->packets++; > > rx->bytes += length; > > @@ -1205,6 +1341,125 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi, > > return done; > > } > > > > +static int tsnep_rx_poll_zc(struct tsnep_rx *rx, struct napi_struct *napi, > > + int budget) > > +{ > > + struct tsnep_rx_entry *entry; > > + struct netdev_queue *tx_nq; > > + struct bpf_prog *prog; > > + struct tsnep_tx *tx; > > + int desc_available; > > + int xdp_status = 0; > > + struct page *page; > > + int done = 0; > > + int length; > > + > > + desc_available = tsnep_rx_desc_available(rx); > > + prog = READ_ONCE(rx->adapter->xdp_prog); > > + if (prog) { > > + tx_nq = netdev_get_tx_queue(rx->adapter->netdev, > > + rx->tx_queue_index); > > + tx = &rx->adapter->tx[rx->tx_queue_index]; > > + } > > + > > + while (likely(done < budget) && (rx->read != rx->write)) { > > + entry = &rx->entry[rx->read]; > > + if ((__le32_to_cpu(entry->desc_wb->properties) & > > + TSNEP_DESC_OWNER_COUNTER_MASK) != > > + (entry->properties & TSNEP_DESC_OWNER_COUNTER_MASK)) > > + break; > > + done++; > > + > > + if (desc_available >= TSNEP_RING_RX_REFILL) { > > + bool reuse = desc_available >= TSNEP_RING_RX_REUSE; > > + > > + desc_available -= tsnep_rx_refill_zc(rx, desc_available, > > + reuse); > > + if (!entry->xdp) { > > + /* buffer has been reused for refill to prevent > > + * empty RX ring, thus buffer cannot be used for > > + * RX processing > > + */ > > + rx->read = (rx->read + 1) % TSNEP_RING_SIZE; > > + desc_available++; > > + > > + rx->dropped++; > > + > > + continue; > > + } > > + } > > + > > + /* descriptor properties shall be read first, because valid data > > + * is signaled there > > + */ > > + dma_rmb(); > > + > > + prefetch(entry->xdp->data); > > + length = __le32_to_cpu(entry->desc_wb->properties) & > > + TSNEP_DESC_LENGTH_MASK; > > + entry->xdp->data_end = entry->xdp->data + length; > > + xsk_buff_dma_sync_for_cpu(entry->xdp, rx->xsk_pool); > > + > > + /* RX metadata with timestamps is in front of actual data, > > + * subtract metadata size to get length of actual data and > > + * consider metadata size as offset of actual data during RX > > + * processing > > + */ > > + length -= TSNEP_RX_INLINE_METADATA_SIZE; > > + > > + rx->read = (rx->read + 1) % TSNEP_RING_SIZE; > > + desc_available++; > > + > > + if (prog) { > > + bool consume; > > + > > + entry->xdp->data += TSNEP_RX_INLINE_METADATA_SIZE; > > + entry->xdp->data_meta += TSNEP_RX_INLINE_METADATA_SIZE; > > + > > + consume = tsnep_xdp_run_prog(rx, prog, entry->xdp, > > + &xdp_status, tx_nq, tx, > > + true); > > reason for separate xdp run prog routine for ZC was usually "likely-fying" > XDP_REDIRECT action as this is the main action for AF_XDP which was giving > us perf improvement. Please try this out on your side to see if this > yields any positive value. One more thing - you have to handle XDP_TX action in a ZC specific way. Your current code will break if you enable xsk_pool and return XDP_TX from XDP prog.