On Wed, Apr 05, 2023 at 09:13:58PM +0200, Gerhard Engleder wrote: > On 03.04.23 19:26, Maciej Fijalkowski wrote: > > On Mon, Apr 03, 2023 at 07:19:15PM +0200, Maciej Fijalkowski wrote: > > > On Sun, Apr 02, 2023 at 09:38:37PM +0200, Gerhard Engleder wrote: > > > > > > Hey Gerhard, > > > > > > > Add support for XSK zero-copy to RX path. The setup of the XSK pool can > > > > be done at runtime. If the netdev is running, then the queue must be > > > > disabled and enabled during reconfiguration. This can be done easily > > > > with functions introduced in previous commits. > > > > > > > > A more important property is that, if the netdev is running, then the > > > > setup of the XSK pool shall not stop the netdev in case of errors. A > > > > broken netdev after a failed XSK pool setup is bad behavior. Therefore, > > > > the allocation and setup of resources during XSK pool setup is done only > > > > before any queue is disabled. Additionally, freeing and later allocation > > > > of resources is eliminated in some cases. Page pool entries are kept for > > > > later use. Two memory models are registered in parallel. As a result, > > > > the XSK pool setup cannot fail during queue reconfiguration. > > > > > > > > In contrast to other drivers, XSK pool setup and XDP BPF program setup > > > > are separate actions. XSK pool setup can be done without any XDP BPF > > > > program. The XDP BPF program can be added, removed or changed without > > > > any reconfiguration of the XSK pool. > > > > > > I won't argue about your design, but I'd be glad if you would present any > > > perf numbers (ZC vs copy mode) just to give us some overview how your > > > implementation works out. Also, please consider using batching APIs and > > > see if this gives you any boost (my assumption is that it would). > > > > > > > > > > > Signed-off-by: Gerhard Engleder <gerhard@xxxxxxxxxxxxxxxxxxxxx> > > > > --- > > > > drivers/net/ethernet/engleder/tsnep.h | 7 + > > > > drivers/net/ethernet/engleder/tsnep_main.c | 432 ++++++++++++++++++++- > > > > drivers/net/ethernet/engleder/tsnep_xdp.c | 67 ++++ > > > > 3 files changed, 488 insertions(+), 18 deletions(-) > > > > (...) > > > > > > + > > > > static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog, > > > > struct xdp_buff *xdp, int *status, > > > > - struct netdev_queue *tx_nq, struct tsnep_tx *tx) > > > > + struct netdev_queue *tx_nq, struct tsnep_tx *tx, > > > > + bool zc) > > > > { > > > > unsigned int length; > > > > - unsigned int sync; > > > > u32 act; > > > > length = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM; > > > > act = bpf_prog_run_xdp(prog, xdp); > > > > - > > > > - /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */ > > > > - sync = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM; > > > > - sync = max(sync, length); > > > > - > > > > switch (act) { > > > > case XDP_PASS: > > > > return false; > > > > @@ -1027,8 +1149,21 @@ static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog, > > > > trace_xdp_exception(rx->adapter->netdev, prog, act); > > > > fallthrough; > > > > case XDP_DROP: > > > > - page_pool_put_page(rx->page_pool, virt_to_head_page(xdp->data), > > > > - sync, true); > > > > + if (zc) { > > > > + xsk_buff_free(xdp); > > > > + } else { > > > > + unsigned int sync; > > > > + > > > > + /* Due xdp_adjust_tail: DMA sync for_device cover max > > > > + * len CPU touch > > > > + */ > > > > + sync = xdp->data_end - xdp->data_hard_start - > > > > + XDP_PACKET_HEADROOM; > > > > + sync = max(sync, length); > > > > + page_pool_put_page(rx->page_pool, > > > > + virt_to_head_page(xdp->data), sync, > > > > + true); > > > > + } > > > > return true; > > > > } > > > > } > > > > @@ -1181,7 +1316,8 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi, > > > > length, false); > > > > consume = tsnep_xdp_run_prog(rx, prog, &xdp, > > > > - &xdp_status, tx_nq, tx); > > > > + &xdp_status, tx_nq, tx, > > > > + false); > > > > if (consume) { > > > > rx->packets++; > > > > rx->bytes += length; > > > > @@ -1205,6 +1341,125 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi, > > > > return done; > > > > } > > > > +static int tsnep_rx_poll_zc(struct tsnep_rx *rx, struct napi_struct *napi, > > > > + int budget) > > > > +{ > > > > + struct tsnep_rx_entry *entry; > > > > + struct netdev_queue *tx_nq; > > > > + struct bpf_prog *prog; > > > > + struct tsnep_tx *tx; > > > > + int desc_available; > > > > + int xdp_status = 0; > > > > + struct page *page; > > > > + int done = 0; > > > > + int length; > > > > + > > > > + desc_available = tsnep_rx_desc_available(rx); > > > > + prog = READ_ONCE(rx->adapter->xdp_prog); > > > > + if (prog) { > > > > + tx_nq = netdev_get_tx_queue(rx->adapter->netdev, > > > > + rx->tx_queue_index); > > > > + tx = &rx->adapter->tx[rx->tx_queue_index]; > > > > + } > > > > + > > > > + while (likely(done < budget) && (rx->read != rx->write)) { > > > > + entry = &rx->entry[rx->read]; > > > > + if ((__le32_to_cpu(entry->desc_wb->properties) & > > > > + TSNEP_DESC_OWNER_COUNTER_MASK) != > > > > + (entry->properties & TSNEP_DESC_OWNER_COUNTER_MASK)) > > > > + break; > > > > + done++; > > > > + > > > > + if (desc_available >= TSNEP_RING_RX_REFILL) { > > > > + bool reuse = desc_available >= TSNEP_RING_RX_REUSE; > > > > + > > > > + desc_available -= tsnep_rx_refill_zc(rx, desc_available, > > > > + reuse); > > > > + if (!entry->xdp) { > > > > + /* buffer has been reused for refill to prevent > > > > + * empty RX ring, thus buffer cannot be used for > > > > + * RX processing > > > > + */ > > > > + rx->read = (rx->read + 1) % TSNEP_RING_SIZE; > > > > + desc_available++; > > > > + > > > > + rx->dropped++; > > > > + > > > > + continue; > > > > + } > > > > + } > > > > + > > > > + /* descriptor properties shall be read first, because valid data > > > > + * is signaled there > > > > + */ > > > > + dma_rmb(); > > > > + > > > > + prefetch(entry->xdp->data); > > > > + length = __le32_to_cpu(entry->desc_wb->properties) & > > > > + TSNEP_DESC_LENGTH_MASK; > > > > + entry->xdp->data_end = entry->xdp->data + length; > > > > + xsk_buff_dma_sync_for_cpu(entry->xdp, rx->xsk_pool); > > > > + > > > > + /* RX metadata with timestamps is in front of actual data, > > > > + * subtract metadata size to get length of actual data and > > > > + * consider metadata size as offset of actual data during RX > > > > + * processing > > > > + */ > > > > + length -= TSNEP_RX_INLINE_METADATA_SIZE; > > > > + > > > > + rx->read = (rx->read + 1) % TSNEP_RING_SIZE; > > > > + desc_available++; > > > > + > > > > + if (prog) { > > > > + bool consume; > > > > + > > > > + entry->xdp->data += TSNEP_RX_INLINE_METADATA_SIZE; > > > > + entry->xdp->data_meta += TSNEP_RX_INLINE_METADATA_SIZE; > > > > + > > > > + consume = tsnep_xdp_run_prog(rx, prog, entry->xdp, > > > > + &xdp_status, tx_nq, tx, > > > > + true); > > > > > > reason for separate xdp run prog routine for ZC was usually "likely-fying" > > > XDP_REDIRECT action as this is the main action for AF_XDP which was giving > > > us perf improvement. Please try this out on your side to see if this > > > yields any positive value. > > > > One more thing - you have to handle XDP_TX action in a ZC specific way. > > Your current code will break if you enable xsk_pool and return XDP_TX from > > XDP prog. > > I took again a look to igc, but I didn't found any specifics for XDP_TX > ZC. Only some buffer flipping, which I assume is needed for shared > pages. > For ice I see a call to xdp_convert_buff_to_frame() in ZC path, which > has some XSK logic within. Is this the ZC specific way? igc calls > xdp_convert_buff_to_frame() in both cases, so I'm not sure. But I will > try the XDP_TX action. I did test only with xdpsock. I think I will back off a bit with a statement that your XDP_TX is clearly broken, here's why. igc when converting xdp_buff to xdp_frame and xdp_buff's memory being backed by xsk_buff_pool will grab the new page from kernel, copy the contents of xdp_buff to it, recycle xdp_buff back to xsk_buff_pool and return new page back to driver (i have just described what xdp_convert_zc_to_xdp_frame() is doing). Thing is that it is expensive and hurts perf and we stepped away from this on ice, this is a matter of storing the xdp_buff onto adequate Tx buffer struct that you can access while cleaning Tx descriptors so that you'll be able to xsk_buff_free() it. So saying 'your current code will break' might have been too much from my side. Just make sure that your XDP_TX action on ZC works. In order to do that, i was loading xdpsock with XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD flag on a queue receiving frames and then running xdp_rxq_info with XDP_TX action.