Re: [RFC PATCH bpf-next 0/8] XDP_REDIRECT_XSK and Batched AF_XDP Rx

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 16/11/2021 08.37, Ciara Loftus wrote:
The common case for AF_XDP sockets (xsks) is creating a single xsk on a queue for sending and
receiving frames as this is analogous to HW packet steering through RSS and other classification
methods in the NIC. AF_XDP uses the xdp redirect infrastructure to direct packets to the socket. It
was designed for the much more complicated case of DEVMAP xdp_redirects which directs traffic to
another netdev and thus potentially another driver. In the xsk redirect case, by skipping the
unnecessary parts of this common code we can significantly improve performance and pave the way
for batching in the driver. This RFC proposes one such way to simplify the infrastructure which
yields a 27% increase in throughput and a decrease in cycles per packet of 24 cycles [1]. The goal
of this RFC is to start a discussion on how best to simplify the single-socket datapath while
providing one method as an example.

Current approach:
1. XSK pointer: an xsk is created and a handle to the xsk is stored in the XSKMAP.
2. XDP program: bpf_redirect_map helper triggers the XSKMAP lookup which stores the result (handle
to the xsk) and the map type (XSKMAP) in the percpu bpf_redirect_info struct. The XDP_REDIRECT
action is returned.
3. XDP_REDIRECT handling called by the driver: the map type (XSKMAP) is read from the
bpf_redirect_info which selects the xsk_map_redirect path. The xsk pointer is retrieved from the
bpf_redirect_info and the XDP descriptor is pushed to the xsk's Rx ring. The socket is added to a
list for flushing later.
4. xdp_do_flush: iterate through the lists of all maps that can be used for redirect (CPUMAP,
DEVMAP and XSKMAP). When XSKMAP is flushed, go through all xsks that had any traffic redirected to
them and bump the Rx ring head pointer(s).

For the end goal of submitting the descriptor to the Rx ring and bumping the head pointer of that
ring, only some of these steps are needed. The rest is overhead. The bpf_redirect_map
infrastructure is needed for all other redirect operations, but is not necessary when redirecting
to a single AF_XDP socket. And similarly, flushing the list for every map type in step 4 is not
necessary when only one socket needs to be flushed.

Proposed approach:
1. XSK pointer: an xsk is created and a handle to the xsk is stored both in the XSKMAP and also the
netdev_rx_queue struct.
2. XDP program: new bpf_redirect_xsk helper returns XDP_REDIRECT_XSK.
3. XDP_REDIRECT_XSK handling called by the driver: the xsk pointer is retrieved from the
netdev_rx_queue struct and the XDP descriptor is pushed to the xsk's Rx ring.
4. xsk_flush: fetch the handle from the netdev_rx_queue and flush the xsk.

This fast path is triggered on XDP_REDIRECT_XSK if:
   (i) AF_XDP socket SW Rx ring configured
  (ii) Exactly one xsk attached to the queue
If any of these conditions are not met, fall back to the same behavior as the original approach:
xdp_redirect_map. This is handled under-the-hood in the new bpf_xdp_redirect_xsk helper so the user
does not need to be aware of these conditions.

Batching:
With this new approach it is possible to optimize the driver by submitting a batch of descriptors
to the Rx ring in Step 3 of the new approach by simply verifying that the action returned from
every program run of each packet in a batch equals XDP_REDIRECT_XSK. That's because with this
action we know the socket to redirect to will be the same for each packet in the batch. This is
not possible with XDP_REDIRECT because the xsk pointer is stored in the bpf_redirect_info and not
guaranteed to be the same for every packet in a batch.

[1] Performance:
The benchmarks were performed on VM running a 2.4GHz Ice Lake host with an i40e device passed
through. The xdpsock app was run on a single core with busy polling and configured in 'rxonly' mode.
./xdpsock -i <iface> -r -B
The improvement in throughput when using the new bpf helper and XDP action was measured at ~13% for
scalar processing, with reduction in cycles per packet of ~13. A further ~14% improvement in
throughput and reduction of ~11 cycles per packet was measured when the batched i40e driver path
was used, for a total improvement of ~27% in throughput and reduction of ~24 cycles per packet.

Other approaches considered:
Two other approaches were considered. The advantage of both being that neither involved introducing
a new XDP action. The first alternative approach considered was to create a new map type
BPF_MAP_TYPE_XSKMAP_DIRECT. When the XDP_REDIRECT action was returned, this map type could be
checked and used as an indicator to skip the map lookup and use the netdev_rx_queue xsk instead.
The second approach considered was similar and involved using a new bpf_redirect_info flag which
could be used in a similar fashion.
While both approaches yielded a performance improvement they were measured at about half of what
was measured for the approach outlined in this RFC. It seems using bpf_redirect_info is too
expensive.

I think it was Bjørn that discovered that accessing the per CPU bpf_redirect_info struct have an overhead of approx 2 ns (times 2.4GHz ~4.8 cycles). Your reduction in cycles per packet was ~13, where ~4.8 seem to be large.

The code access this_cpu_ptr(&bpf_redirect_info) two times.
One time in the BPF-helper redirect call and second in xdp_do_redirect.
(Hint xdp_redirect_map end-up calling __bpf_xdp_redirect_map)

Thus, it seems (as you say), the bpf_redirect_info approach is too expensive. May be should look at storing bpf_redirect_info in a place that doesn't requires the this_cpu_ptr() lookup... or cache the lookup per NAPI cycle.
Have you tried this?


Also, centralised processing of XDP actions was investigated. This would involve porting all drivers
to a common interface for handling XDP actions which would greatly simplify the work involved in
adding support for new XDP actions such as XDP_REDIRECT_XSK. However it was deemed at this point to
be more complex than adding support for the new action to every driver. Should this series be
considered worth pursuing for a proper patch set, the intention would be to update each driver
individually.

I'm fine with adding a new helper, but I don't like introducing a new XDP_REDIRECT_XSK action, which requires updating ALL the drivers.

With XDP_REDIRECT infra we beleived we didn't need to add more XDP-action code to drivers, as we multiplex/add new features by extending the bpf_redirect_info. In this extreme performance case, it seems the this_cpu_ptr "lookup" of bpf_redirect_info is the performance issue itself.

Could you experiement with different approaches that modify xdp_do_redirect() to handle if new helper bpf_redirect_xsk was called, prior to this_cpu_ptr() call.
(Thus, avoiding to introduce a new XDP-action).

Thank you to Magnus Karlsson and Maciej Fijalkowski for several suggestions and insight provided.

TODO:
* Add selftest(s)
* Add support for all copy and zero copy drivers
* Libxdp support

The series applies on commit e5043894b21f ("bpftool: Use libbpf_get_error() to check error")

Thanks,
Ciara

Ciara Loftus (8):
   xsk: add struct xdp_sock to netdev_rx_queue
   bpf: add bpf_redirect_xsk helper and XDP_REDIRECT_XSK action
   xsk: handle XDP_REDIRECT_XSK and expose xsk_rcv/flush
   i40e: handle the XDP_REDIRECT_XSK action
   xsk: implement a batched version of xsk_rcv
   i40e: isolate descriptor processing in separate function
   i40e: introduce batched XDP rx descriptor processing
   libbpf: use bpf_redirect_xsk in the default program

  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  13 +-
  .../ethernet/intel/i40e/i40e_txrx_common.h    |   1 +
  drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 285 +++++++++++++++---
  include/linux/netdevice.h                     |   2 +
  include/net/xdp_sock_drv.h                    |  49 +++
  include/net/xsk_buff_pool.h                   |  22 ++
  include/uapi/linux/bpf.h                      |  13 +
  kernel/bpf/verifier.c                         |   7 +-
  net/core/dev.c                                |  14 +
  net/core/filter.c                             |  26 ++
  net/xdp/xsk.c                                 |  69 ++++-
  net/xdp/xsk_queue.h                           |  31 ++
  tools/include/uapi/linux/bpf.h                |  13 +
  tools/lib/bpf/xsk.c                           |  50 ++-
  14 files changed, 551 insertions(+), 44 deletions(-)





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux