On Mon, Oct 12, 2020 at 10:37 AM Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote: > > On Fri, Oct 9, 2020 at 5:03 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > > > On 10/8/20 4:12 PM, Magnus Karlsson wrote: > > > From: Magnus Karlsson <magnus.karlsson@xxxxxxxxx> > > > > > > Introduce one cache line worth of padding between the producer and > > > consumer pointers in all the lockless rings. This so that the HW > > > adjacency prefetcher will not prefetch the consumer pointer when the > > > producer pointer is used and vice versa. This improves throughput > > > performance for the l2fwd sample app with 2% on my machine with HW > > > prefetching turned on. > > > > > > Signed-off-by: Magnus Karlsson <magnus.karlsson@xxxxxxxxx> > > > > Applied, thanks! > > > > > net/xdp/xsk_queue.h | 4 ++++ > > > 1 file changed, 4 insertions(+) > > > > > > diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h > > > index dc1dd5e..3c235d2 100644 > > > --- a/net/xdp/xsk_queue.h > > > +++ b/net/xdp/xsk_queue.h > > > @@ -15,6 +15,10 @@ > > > > > > struct xdp_ring { > > > u32 producer ____cacheline_aligned_in_smp; > > > + /* Hinder the adjacent cache prefetcher to prefetch the consumer pointer if the producer > > > + * pointer is touched and vice versa. > > > + */ > > > + u32 pad ____cacheline_aligned_in_smp; > > > u32 consumer ____cacheline_aligned_in_smp; > > > u32 flags; > > > }; > > > > > > > I was wondering whether we should even generalize this further for reuse > > elsewhere e.g. ... > > > > diff --git a/include/linux/cache.h b/include/linux/cache.h > > index 1aa8009f6d06..5521dab01649 100644 > > --- a/include/linux/cache.h > > +++ b/include/linux/cache.h > > @@ -85,4 +85,17 @@ > > #define cache_line_size() L1_CACHE_BYTES > > #endif > > > > +/* > > + * Dummy element for use in structs in order to pad a cacheline > > + * aligned element with an extra cacheline to hinder the adjacent > > + * cache prefetcher to prefetch the subsequent struct element. > > + */ > > +#ifndef ____cacheline_padding_in_smp > > +# ifdef CONFIG_SMP > > +# define ____cacheline_padding_in_smp u8 :8 ____cacheline_aligned_in_smp > > +# else > > +# define ____cacheline_padding_in_smp > > +# endif /* CONFIG_SMP */ > > +#endif > > + > > #endif /* __LINUX_CACHE_H */ > > diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h > > index cdb9cf3cd136..1da36423e779 100644 > > --- a/net/xdp/xsk_queue.h > > +++ b/net/xdp/xsk_queue.h > > @@ -15,11 +15,9 @@ > > > > struct xdp_ring { > > u32 producer ____cacheline_aligned_in_smp; > > - /* Hinder the adjacent cache prefetcher to prefetch the consumer > > - * pointer if the producer pointer is touched and vice versa. > > - */ > > - u32 pad ____cacheline_aligned_in_smp; > > + ____cacheline_padding_in_smp; > > u32 consumer ____cacheline_aligned_in_smp; > > + ____cacheline_padding_in_smp; > > u32 flags; > > }; > > This should be beneficial in theory, though I could not measure any > statistically significant improvement. Though, the flags variable is > touched much less frequently than the producer and consumer pointers, > so that might explain it. We also need to make the compiler allocate > flags to a cache line 128 bytes (2 cache lines) from the consumer > pointer like this: > > u32 consumer ____cacheline_aligned_in_smp; > ____cacheline_padding_in_smp; > u32 flags ____cacheline_aligned_in_smp; > > > ... was there any improvement to also pad after the consumer given the struct > > xdp_ring is also embedded into other structs? > > Good idea. Yes, I do believe I see another ~0.4% increase and more > stable high numbers when trying this out. The xdp_ring is followed by > the ring descriptors themselves in both the rt/tx rings and the umem > rings. And these rings are quite large, 2K in the sample app, so > accessed less frequently (1/8th of the time with a batch size of 256 > and ring size 2K) which might explain the lower increase. In the end, > I ended up with the following struct: > > u32 producer ____cacheline_aligned_in_smp; > ____cacheline_padding_in_smp; > u32 consumer ____cacheline_aligned_in_smp; > ____cacheline_padding_in_smp; > u32 flags ____cacheline_aligned_in_smp; > ____cacheline_padding_in_smp; Actually, this might make more sense and save some memory: u32 producer ____cacheline_aligned_in_smp; ____cacheline_padding_in_smp; u32 consumer ____cacheline_aligned_in_smp; u32 flags; ____cacheline_padding_in_smp; So keep the flags colocated with the consumer on the same cache line. The reason I put it there to start with was that it is usually set in conjunction with the consumer pointer being updated. This might also explain why I did not see any performance improvement by putting it on its own 128-byte cache line. In summary, we make sure producer and consumer are separated with 128 bytes as well as consumer and the start of the descriptor ring. > Do you want to submit a patch, or shall I do it? I like your > ____cacheline_padding_in_smp better than my explicit "padN" member. > > Thanks: Magnus > > > Thanks, > > Daniel