RE: [PATCH V3 net-next] net: fec: add XDP_TX feature support

Wei Fang <wei.fang@xxxxxxx> · Mon, 7 Aug 2023 10:30:34 +0000

> > The flow-control was not disabled before, so according to your
> > suggestion, I disable the flow-control on the both boards and run the
> > test again, the performance is slightly improved, but still can not
> > see a clear difference between the two methods. Below are the results.
> 
> Something else must be stalling the CPU.
> When looking at fec_main.c code, I noticed that
> fec_enet_txq_xmit_frame() will do a MMIO write for every xdp_frame (to
> trigger transmit start), which I believe will stall the CPU.
> The ndo_xdp_xmit/fec_enet_xdp_xmit does bulking, and should be the
> function that does the MMIO write to trigger transmit start.
> 
We'd better keep a MMIO write for every xdp_frame on txq, as you know,
the txq will be inactive when no additional ready descriptors remain in the
tx-BDR. So it may increase the delay of the packets if we do a MMIO write
for multiple packets.

> $ git diff
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index 03ac7690b5c4..57a6a3899b80 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -3849,9 +3849,6 @@ static int fec_enet_txq_xmit_frame(struct
> fec_enet_private *fep,
> 
>          txq->bd.cur = bdp;
> 
> -       /* Trigger transmission start */
> -       writel(0, txq->bd.reg_desc_active);
> -
>          return 0;
>   }
> 
> @@ -3880,6 +3877,9 @@ static int fec_enet_xdp_xmit(struct net_device
> *dev,
>                  sent_frames++;
>          }
> 
> +       /* Trigger transmission start */
> +       writel(0, txq->bd.reg_desc_active);
> +
>          __netif_tx_unlock(nq);
> 
>          return sent_frames;
> 
> 
> > Result: use "sync_dma_len" method
> > root@imx8mpevk:~# ./xdp2 eth0
> 
> The xdp2 (and xdp1) program(s) have a performance issue (due to using
> 
> Can I ask you to test using xdp_rxq_info, like:
> 
>   sudo ./xdp_rxq_info --dev mlx5p1 --action XDP_TX
> 
Yes, below are the results, the results are also basically the same.
Result 1: current method
./xdp_rxq_info --dev eth0 --action XDP_TX
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,102     0
XDP-RX CPU      total   259,102
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   259,102     0
rx_queue_index    0:sum 259,102
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,498     0
XDP-RX CPU      total   259,498
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   259,496     0
rx_queue_index    0:sum 259,496
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,408     0
XDP-RX CPU      total   259,408

Result 2: dma_sync_len method
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       258,254     0
XDP-RX CPU      total   258,254
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   258,254     0
rx_queue_index    0:sum 258,254
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,316     0
XDP-RX CPU      total   259,316
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   259,318     0
rx_queue_index    0:sum 259,318
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,554     0
XDP-RX CPU      total   259,554
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   259,553     0
rx_queue_index    0:sum 259,553

> 
> > proto 17:     258886 pkt/s
> > proto 17:     258879 pkt/s
> 
> If you provide numbers for xdp_redirect, then we could better evaluate if
> changing the lock per xdp_frame, for XDP_TX also, is worth it.
> 
For XDP_REDIRECT, the performance show as follow.
root@imx8mpevk:~# ./xdp_redirect eth1 eth0
Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
eth1->eth0        221,642 rx/s       0 err,drop/s      221,643 xmit/s
eth1->eth0        221,761 rx/s       0 err,drop/s      221,760 xmit/s
eth1->eth0        221,793 rx/s       0 err,drop/s      221,794 xmit/s
eth1->eth0        221,825 rx/s       0 err,drop/s      221,825 xmit/s
eth1->eth0        221,823 rx/s       0 err,drop/s      221,821 xmit/s
eth1->eth0        221,815 rx/s       0 err,drop/s      221,816 xmit/s
eth1->eth0        222,016 rx/s       0 err,drop/s      222,016 xmit/s
eth1->eth0        222,059 rx/s       0 err,drop/s      222,059 xmit/s
eth1->eth0        222,085 rx/s       0 err,drop/s      222,089 xmit/s
eth1->eth0        221,956 rx/s       0 err,drop/s      221,952 xmit/s
eth1->eth0        222,070 rx/s       0 err,drop/s      222,071 xmit/s
eth1->eth0        222,017 rx/s       0 err,drop/s      222,017 xmit/s
eth1->eth0        222,069 rx/s       0 err,drop/s      222,067 xmit/s
eth1->eth0        221,986 rx/s       0 err,drop/s      221,987 xmit/s
eth1->eth0        221,932 rx/s       0 err,drop/s      221,936 xmit/s
eth1->eth0        222,045 rx/s       0 err,drop/s      222,041 xmit/s
eth1->eth0        222,014 rx/s       0 err,drop/s      222,014 xmit/s
  Packets received    : 3,772,908
  Average packets/s   : 221,936
  Packets transmitted : 3,772,908
  Average transmit/s  : 221,936

> And also find out of moving the MMIO write have any effect.
> 
I move the MMIO write to fec_enet_xdp_xmit(), the result shows as follow,
the performance is slightly improved.

root@imx8mpevk:~# ./xdp_redirect eth1 eth0
Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
eth1->eth0        222,666 rx/s        0 err,drop/s      222,668 xmit/s
eth1->eth0        221,663 rx/s        0 err,drop/s      221,664 xmit/s
eth1->eth0        222,743 rx/s        0 err,drop/s      222,741 xmit/s
eth1->eth0        222,917 rx/s        0 err,drop/s      222,923 xmit/s
eth1->eth0        221,810 rx/s        0 err,drop/s      221,808 xmit/s
eth1->eth0        222,891 rx/s        0 err,drop/s      222,888 xmit/s
eth1->eth0        222,983 rx/s        0 err,drop/s      222,984 xmit/s
eth1->eth0        221,655 rx/s        0 err,drop/s      221,653 xmit/s
eth1->eth0        222,827 rx/s        0 err,drop/s      222,827 xmit/s
eth1->eth0        221,728 rx/s        0 err,drop/s      221,728 xmit/s
eth1->eth0        222,790 rx/s        0 err,drop/s      222,789 xmit/s
eth1->eth0        222,874 rx/s        0 err,drop/s      222,874 xmit/s
eth1->eth0        221,888 rx/s        0 err,drop/s      221,887 xmit/s
eth1->eth0        223,057 rx/s        0 err,drop/s      223,056 xmit/s
eth1->eth0        222,219 rx/s        0 err,drop/s      222,220 xmit/s
  Packets received    : 3,336,711
  Average packets/s   : 222,447
  Packets transmitted : 3,336,710
  Average transmit/s  : 222,447

> I also noticed driver does a MMIO write (on rxq) for every RX-packet in
> fec_enet_rx_queue() napi-poll loop.  This also looks like a potential
> performance stall.
> 
The same as txq, the rxq will be inactive if the rx-BDR has no free BDs, so we'd
better do a MMIO write when we recycle a BD, so that the hardware can timely
attach the received pakcets on the rx-BDR.

In addition, I also tried to avoid using xdp_convert_buff_to_frame(), but the
performance of XDP_TX is still not improved. :(

After these days of testing, I think it's best to keep the solution in V3, and then
make some optimizations on the V3 patch.