Re: [PATCH bpf-next] xsk: introduce padding between ring pointers

Magnus Karlsson <magnus.karlsson@xxxxxxxxx> · Mon, 12 Oct 2020 11:37:28 +0200

On Mon, Oct 12, 2020 at 10:37 AM Magnus Karlsson
<magnus.karlsson@xxxxxxxxx> wrote:
>
> On Fri, Oct 9, 2020 at 5:03 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> >
> > On 10/8/20 4:12 PM, Magnus Karlsson wrote:
> > > From: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
> > >
> > > Introduce one cache line worth of padding between the producer and
> > > consumer pointers in all the lockless rings. This so that the HW
> > > adjacency prefetcher will not prefetch the consumer pointer when the
> > > producer pointer is used and vice versa. This improves throughput
> > > performance for the l2fwd sample app with 2% on my machine with HW
> > > prefetching turned on.
> > >
> > > Signed-off-by: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
> >
> > Applied, thanks!
> >
> > >   net/xdp/xsk_queue.h | 4 ++++
> > >   1 file changed, 4 insertions(+)
> > >
> > > diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> > > index dc1dd5e..3c235d2 100644
> > > --- a/net/xdp/xsk_queue.h
> > > +++ b/net/xdp/xsk_queue.h
> > > @@ -15,6 +15,10 @@
> > >
> > >   struct xdp_ring {
> > >       u32 producer ____cacheline_aligned_in_smp;
> > > +     /* Hinder the adjacent cache prefetcher to prefetch the consumer pointer if the producer
> > > +      * pointer is touched and vice versa.
> > > +      */
> > > +     u32 pad ____cacheline_aligned_in_smp;
> > >       u32 consumer ____cacheline_aligned_in_smp;
> > >       u32 flags;
> > >   };
> > >
> >
> > I was wondering whether we should even generalize this further for reuse
> > elsewhere e.g. ...
> >
> > diff --git a/include/linux/cache.h b/include/linux/cache.h
> > index 1aa8009f6d06..5521dab01649 100644
> > --- a/include/linux/cache.h
> > +++ b/include/linux/cache.h
> > @@ -85,4 +85,17 @@
> >   #define cache_line_size()      L1_CACHE_BYTES
> >   #endif
> >
> > +/*
> > + * Dummy element for use in structs in order to pad a cacheline
> > + * aligned element with an extra cacheline to hinder the adjacent
> > + * cache prefetcher to prefetch the subsequent struct element.
> > + */
> > +#ifndef ____cacheline_padding_in_smp
> > +# ifdef CONFIG_SMP
> > +#  define ____cacheline_padding_in_smp u8 :8 ____cacheline_aligned_in_smp
> > +# else
> > +#  define ____cacheline_padding_in_smp
> > +# endif /* CONFIG_SMP */
> > +#endif
> > +
> >   #endif /* __LINUX_CACHE_H */
> > diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> > index cdb9cf3cd136..1da36423e779 100644
> > --- a/net/xdp/xsk_queue.h
> > +++ b/net/xdp/xsk_queue.h
> > @@ -15,11 +15,9 @@
> >
> >   struct xdp_ring {
> >          u32 producer ____cacheline_aligned_in_smp;
> > -       /* Hinder the adjacent cache prefetcher to prefetch the consumer
> > -        * pointer if the producer pointer is touched and vice versa.
> > -        */
> > -       u32 pad ____cacheline_aligned_in_smp;
> > +       ____cacheline_padding_in_smp;
> >          u32 consumer ____cacheline_aligned_in_smp;
> > +       ____cacheline_padding_in_smp;
> >          u32 flags;
> >   };
>
> This should be beneficial in theory, though I could not measure any
> statistically significant improvement. Though, the flags variable is
> touched much less frequently than the producer and consumer pointers,
> so that might explain it. We also need to make the compiler allocate
> flags to a cache line 128 bytes (2 cache lines) from the consumer
> pointer like this:
>
> u32 consumer ____cacheline_aligned_in_smp;
> ____cacheline_padding_in_smp;
> u32 flags ____cacheline_aligned_in_smp;
>
> > ... was there any improvement to also pad after the consumer given the struct
> > xdp_ring is also embedded into other structs?
>
> Good idea. Yes, I do believe I see another ~0.4% increase and more
> stable high numbers when trying this out. The xdp_ring is followed by
> the ring descriptors themselves in both the rt/tx rings and the umem
> rings. And these rings are quite large, 2K in the sample app, so
> accessed less frequently (1/8th of the time with a batch size of 256
> and ring size 2K) which might explain the lower increase. In the end,
> I ended up with the following struct:
>
> u32 producer ____cacheline_aligned_in_smp;
> ____cacheline_padding_in_smp;
> u32 consumer ____cacheline_aligned_in_smp;
> ____cacheline_padding_in_smp;
> u32 flags ____cacheline_aligned_in_smp;
> ____cacheline_padding_in_smp;

Actually, this might make more sense and save some memory:

u32 producer ____cacheline_aligned_in_smp;
____cacheline_padding_in_smp;
u32 consumer ____cacheline_aligned_in_smp;
u32 flags;
____cacheline_padding_in_smp;

So keep the flags colocated with the consumer on the same cache line.
The reason I put it there to start with was that it is usually set in
conjunction with the consumer pointer being updated. This might also
explain why I did not see any performance improvement by putting it on
its own 128-byte cache line. In summary, we make sure producer and
consumer are separated with 128 bytes as well as consumer and the
start of the descriptor ring.

> Do you want to submit a patch, or shall I do it? I like your
> ____cacheline_padding_in_smp better than my explicit "padN" member.
>
> Thanks: Magnus
>
> > Thanks,
> > Daniel