traffic redirection and tapping

Henning Fehrmann <henning.fehrmann@xxxxxxxxxx> · Thu, 29 Sep 2022 16:56:21 +0200

Hey folks,

we have a node with 4 dual-port MT28908 Mellanox cards installed. We want to
redirect the incoming traffic from an ingress port to an egress port on
the same NIC using the bpf_redirect() helper function.

Each ingress port redirects traffic of ca 25Gibps with a packet size of ca
1500Bytes. It is almost solely UDP multicast traffic. There are four
multicast groups per bridge coming from 64 different source IPs.

I raised the rx ring buffer size of the ingress ports to 8192
and aggregated the interrupts:
ethtool -C ingress_port_i rx-frames 512
ethtool -C ingress_port_i rx-usecs 16

I still need to check whether these numbers make sense.

The CPU utilization is moderate around 20-30%.

On top of it we'd like to record the traffic streams using
the bpf_ringbuf_output function.
For now I write only into the ringbuffer to make the data available in
user space. I have 16 different ringbuffer, one for each CPU. I am
currently not sure how to enforce that the ringbuffer sits in the right NUMA
node. Is there a way?

numastat tells me that I have zero numa misses so that is possibly OK.

In user land I start 16 threads pinned to the cores running a handler to process the
ringbuffer content. Currently I only count packets.

Doing it all cores are entirely utilized and I lose packets.

perf top tells me (only CPU 8):

   PerfTop:    3868 irqs/sec  kernel:95.9%  exact: 97.6% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, CPU: 8)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    25.72%         11804  [kernel]                                        [k] xdp_do_redirect
     8.73%          4008  [kernel]                                        [k] memcpy
     7.19%          3297  [kernel]                                        [k] bq_enqueue
     7.16%          3287  [kernel]                                        [k] check_preemption_disabled
     4.28%          1959  [kernel]                                        [k] bpf_ringbuf_output
     4.20%          1925  [kernel]                                        [k] mlx5e_xdp_handle

and in perf record I indeed find the memcopy issue + some load for mlx5e_napi_poll:

--45.35%--__napi_poll
          mlx5e_napi_poll
          |
          |--36.80%--mlx5e_poll_rx_cq
          |          |
          |          |--35.64%--mlx5e_handle_rx_cqe_mpwrq
          |          |          |
          |          |           --35.56%--mlx5e_skb_from_cqe_mpwrq_linear
          |          |                     |
          |          |                      --34.57%--mlx5e_xdp_handle
          |          |                                |
          |          |                                |--32.15%--bpf_prog_82775e2abf7feec0_xdp_tap_ingress_prog
          |          |                                |          |
          |          |                                |           --31.20%--bpf_ringbuf_output
          |          |                                |                     |
          |          |                                |                      --30.69%--memcpy
          |          |                                |                                |
          |          |                                |                                 --0.77%--asm_common_interrupt
          |          |                                |                                           common_interrupt
          |          |                                |                                           |
          |          |                                |                                            --0.72%--__common_interrupt
          |          |                                |                                                      handle_edge_irq
          |          |                                |                                                      |

Is there any chance to improve the ringbuffer output?
Or could I get the packets onto disks in any other way using bpf helper
functions? Do I need to gather more or other information?

Thank you for your help.

Cheers,
Henning