Re: [PATCH bpf-next 1/2] xsk: update rings for load-acquire/store-release semantics

Björn Töpel <bjorn.topel@xxxxxxxxx> · Tue, 2 Mar 2021 09:04:16 +0100

On 2021-03-01 17:08, Toke Høiland-Jørgensen wrote:
Björn Töpel <bjorn.topel@xxxxxxxxx> writes:

From: Björn Töpel <bjorn.topel@xxxxxxxxx>

Currently, the AF_XDP rings uses smp_{r,w,}mb() fences on the
kernel-side. By updating the rings for load-acquire/store-release
semantics, the full barrier on the consumer side can be replaced with
improved performance as a nice side-effect.

Note that this change does *not* require similar changes on the
libbpf/userland side, however it is recommended [1].

On x86-64 systems, by removing the smp_mb() on the Rx and Tx side, the
l2fwd AF_XDP xdpsock sample performance increases by
1%. Weakly-ordered platforms, such as ARM64 might benefit even more.

[1] https://lore.kernel.org/bpf/20200316184423.GA14143@willie-the-truck/

Signed-off-by: Björn Töpel <bjorn.topel@xxxxxxxxx>
---
  net/xdp/xsk_queue.h | 27 +++++++++++----------------
  1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 2823b7c3302d..e24279d8d845 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -47,19 +47,18 @@ struct xsk_queue {
  	u64 queue_empty_descs;
  };
  
-/* The structure of the shared state of the rings are the same as the
- * ring buffer in kernel/events/ring_buffer.c. For the Rx and completion
- * ring, the kernel is the producer and user space is the consumer. For
- * the Tx and fill rings, the kernel is the consumer and user space is
- * the producer.
+/* The structure of the shared state of the rings are a simple
+ * circular buffer, as outlined in
+ * Documentation/core-api/circular-buffers.rst. For the Rx and
+ * completion ring, the kernel is the producer and user space is the
+ * consumer. For the Tx and fill rings, the kernel is the consumer and
+ * user space is the producer.
   *
   * producer                         consumer
   *
- * if (LOAD ->consumer) {           LOAD ->producer
- *                    (A)           smp_rmb()       (C)
+ * if (LOAD ->consumer) {  (A)      LOAD.acq ->producer  (C)

Why is LOAD.acq not needed on the consumer side?


You mean why LOAD.acq is not needed on the *producer* side, i.e. the
->consumer? The ->consumer is a control dependency for the store, so
there is no ordering constraint for ->consumer at producer side. If
there's no space, no data is written. So, no barrier is needed there --
at least that has been my perspective.

This is very similar to the buffer in
Documentation/core-api/circular-buffers.rst. Roping in Paul for some
guidance.


Björn

-Toke