On Fri, 9 Apr 2021 08:40:51 +0200 Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote: > On Fri, Apr 9, 2021 at 1:06 AM Neal Shukla <nshukla@xxxxxxxxxxxxx> wrote: > > > > Using perf, we've confirmed that the line mentioned has a 25.58% cache miss > > rate. > > Do these hit in the LLC or in DRAM? In any case, your best bet is > likely to prefetch this into your L1/L2. In my experience, the best > way to do this is not to use an explicit prefetch instruction but to > touch/fetch the cache lines you need in the beginning of your > computation and let the fetch latency and the usage of the first cache > line hide the latencies of fetching the others. In your case, touch > both metadata and packet at the same time. Work with the metadata and > other things then come back to the packet data and hopefully the > relevant part will reside in the cache or registers by now. If that > does not work, touch packet number N+1 just before starting with > packet N. > > Very general recommendations but hope it helps anyway. How exactly to > do this efficiently is very application dependent. I see you use driver i40e and that driver does a net_prefetch(xdp->data) *AFTER* the XDP hook. Thus, that could explain why you are seeing this. Can you try the patch below, and see if it solves your observed issue? > > On Thu, Apr 8, 2021 at 2:38 PM Zvi Effron <zeffron@xxxxxxxxxxxxx> wrote: > > > > > > Apologies for the spam to anyone who received my first response, but > > > it was accidentally sent as HTML and rejected by the mailing list. > > > > > > On Thu, Apr 8, 2021 at 11:20 AM Neal Shukla <nshukla@xxxxxxxxxxxxx> wrote: > > > > > > > > System Info: > > > > CPU: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz > > > > Network Adapter/NIC: Intel X710 > > > > Driver: i40e > > > > Kernel version: 5.8.15 > > > > OS: Fedora 33 > > > > > > > > > > Slight correction, we're actually on the 5.10.10 kernel. [PATCH] i40e: Move net_prefetch to benefit XDP From: Jesper Dangaard Brouer <brouer@xxxxxxxxxx> DEBUG PATCH WITH XXX comments Signed-off-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx> --- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index e398b8ac2a85..c09b8a5e6a2a 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -2121,7 +2121,7 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring, struct sk_buff *skb; /* prefetch first cache line of first page */ - net_prefetch(xdp->data); + net_prefetch(xdp->data); // XXX: Too late for XDP /* Note, we get here by enabling legacy-rx via: * @@ -2205,7 +2205,7 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring, * likely have a consumer accessing first few bytes of meta * data, and then actual data. */ - net_prefetch(xdp->data_meta); +// net_prefetch(xdp->data_meta); //XXX: too late for XDP /* build an skb around the page buffer */ skb = build_skb(xdp->data_hard_start, truesize); @@ -2513,6 +2513,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget) /* At larger PAGE_SIZE, frame_sz depend on len size */ xdp.frame_sz = i40e_rx_frame_truesize(rx_ring, size); #endif + net_prefetch(xdp->data); skb = i40e_run_xdp(rx_ring, &xdp); } @@ -2530,6 +2531,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget) } else if (skb) { i40e_add_rx_frag(rx_ring, rx_buffer, skb, size); } else if (ring_uses_build_skb(rx_ring)) { + // XXX: net_prefetch called after i40e_run_xdp() skb = i40e_build_skb(rx_ring, rx_buffer, &xdp); } else { skb = i40e_construct_skb(rx_ring, rx_buffer, &xdp);