On Wed, Apr 10, 2019 at 01:43:58PM +0200, Jesper Dangaard Brouer wrote: > A lot of the performance gain comes from this patch. > > While analysing performance overhead it was found that the largest CPU > stalls were caused when touching the struct page area. It is first read with > a READ_ONCE from build_skb_around via page_is_pfmemalloc(), and when freed > written by page_frag_free() call. > > Measurements show that the prefetchw (W) variant operation is needed to > achieve the performance gain. We believe this optimization it two fold, > first the W-variant saves one step in the cache-coherency protocol, and > second it helps us to avoid the non-temporal prefetch HW optimizations and > bring this into all cache-levels. It might be worth investigating if > prefetch into L2 will have the same benefit. > > Signed-off-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx> > --- > kernel/bpf/cpumap.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c > index b82a11556ad5..4758482ab5b9 100644 > --- a/kernel/bpf/cpumap.c > +++ b/kernel/bpf/cpumap.c > @@ -281,6 +281,18 @@ static int cpu_map_kthread_run(void *data) > * consume side valid as no-resize allowed of queue. > */ > n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH); > + > + for (i = 0; i < n; i++) { > + void *f = frames[i]; > + struct page *page = virt_to_page(f); > + > + /* Bring struct page memory area to curr CPU. Read by > + * build_skb_around via page_is_pfmemalloc(), and when > + * freed written by page_frag_free call. > + */ > + prefetchw(page); > + } > + > m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs); > if (unlikely(m == 0)) { > for (i = 0; i < n; i++) > LGTM Acked-by: Ilias Apalodimas <ilias.apalodimas@xxxxxxxxxx>