> > The flow-control was not disabled before, so according to your > > suggestion, I disable the flow-control on the both boards and run the > > test again, the performance is slightly improved, but still can not > > see a clear difference between the two methods. Below are the results. > > Something else must be stalling the CPU. > When looking at fec_main.c code, I noticed that > fec_enet_txq_xmit_frame() will do a MMIO write for every xdp_frame (to > trigger transmit start), which I believe will stall the CPU. > The ndo_xdp_xmit/fec_enet_xdp_xmit does bulking, and should be the > function that does the MMIO write to trigger transmit start. > We'd better keep a MMIO write for every xdp_frame on txq, as you know, the txq will be inactive when no additional ready descriptors remain in the tx-BDR. So it may increase the delay of the packets if we do a MMIO write for multiple packets. > $ git diff > diff --git a/drivers/net/ethernet/freescale/fec_main.c > b/drivers/net/ethernet/freescale/fec_main.c > index 03ac7690b5c4..57a6a3899b80 100644 > --- a/drivers/net/ethernet/freescale/fec_main.c > +++ b/drivers/net/ethernet/freescale/fec_main.c > @@ -3849,9 +3849,6 @@ static int fec_enet_txq_xmit_frame(struct > fec_enet_private *fep, > > txq->bd.cur = bdp; > > - /* Trigger transmission start */ > - writel(0, txq->bd.reg_desc_active); > - > return 0; > } > > @@ -3880,6 +3877,9 @@ static int fec_enet_xdp_xmit(struct net_device > *dev, > sent_frames++; > } > > + /* Trigger transmission start */ > + writel(0, txq->bd.reg_desc_active); > + > __netif_tx_unlock(nq); > > return sent_frames; > > > > Result: use "sync_dma_len" method > > root@imx8mpevk:~# ./xdp2 eth0 > > The xdp2 (and xdp1) program(s) have a performance issue (due to using > > Can I ask you to test using xdp_rxq_info, like: > > sudo ./xdp_rxq_info --dev mlx5p1 --action XDP_TX > Yes, below are the results, the results are also basically the same. Result 1: current method ./xdp_rxq_info --dev eth0 --action XDP_TX Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac XDP stats CPU pps issue-pps XDP-RX CPU 0 259,102 0 XDP-RX CPU total 259,102 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 259,102 0 rx_queue_index 0:sum 259,102 Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac XDP stats CPU pps issue-pps XDP-RX CPU 0 259,498 0 XDP-RX CPU total 259,498 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 259,496 0 rx_queue_index 0:sum 259,496 Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac XDP stats CPU pps issue-pps XDP-RX CPU 0 259,408 0 XDP-RX CPU total 259,408 Result 2: dma_sync_len method Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac XDP stats CPU pps issue-pps XDP-RX CPU 0 258,254 0 XDP-RX CPU total 258,254 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 258,254 0 rx_queue_index 0:sum 258,254 Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac XDP stats CPU pps issue-pps XDP-RX CPU 0 259,316 0 XDP-RX CPU total 259,316 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 259,318 0 rx_queue_index 0:sum 259,318 Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac XDP stats CPU pps issue-pps XDP-RX CPU 0 259,554 0 XDP-RX CPU total 259,554 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 259,553 0 rx_queue_index 0:sum 259,553 > > > proto 17: 258886 pkt/s > > proto 17: 258879 pkt/s > > If you provide numbers for xdp_redirect, then we could better evaluate if > changing the lock per xdp_frame, for XDP_TX also, is worth it. > For XDP_REDIRECT, the performance show as follow. root@imx8mpevk:~# ./xdp_redirect eth1 eth0 Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec) eth1->eth0 221,642 rx/s 0 err,drop/s 221,643 xmit/s eth1->eth0 221,761 rx/s 0 err,drop/s 221,760 xmit/s eth1->eth0 221,793 rx/s 0 err,drop/s 221,794 xmit/s eth1->eth0 221,825 rx/s 0 err,drop/s 221,825 xmit/s eth1->eth0 221,823 rx/s 0 err,drop/s 221,821 xmit/s eth1->eth0 221,815 rx/s 0 err,drop/s 221,816 xmit/s eth1->eth0 222,016 rx/s 0 err,drop/s 222,016 xmit/s eth1->eth0 222,059 rx/s 0 err,drop/s 222,059 xmit/s eth1->eth0 222,085 rx/s 0 err,drop/s 222,089 xmit/s eth1->eth0 221,956 rx/s 0 err,drop/s 221,952 xmit/s eth1->eth0 222,070 rx/s 0 err,drop/s 222,071 xmit/s eth1->eth0 222,017 rx/s 0 err,drop/s 222,017 xmit/s eth1->eth0 222,069 rx/s 0 err,drop/s 222,067 xmit/s eth1->eth0 221,986 rx/s 0 err,drop/s 221,987 xmit/s eth1->eth0 221,932 rx/s 0 err,drop/s 221,936 xmit/s eth1->eth0 222,045 rx/s 0 err,drop/s 222,041 xmit/s eth1->eth0 222,014 rx/s 0 err,drop/s 222,014 xmit/s Packets received : 3,772,908 Average packets/s : 221,936 Packets transmitted : 3,772,908 Average transmit/s : 221,936 > And also find out of moving the MMIO write have any effect. > I move the MMIO write to fec_enet_xdp_xmit(), the result shows as follow, the performance is slightly improved. root@imx8mpevk:~# ./xdp_redirect eth1 eth0 Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec) eth1->eth0 222,666 rx/s 0 err,drop/s 222,668 xmit/s eth1->eth0 221,663 rx/s 0 err,drop/s 221,664 xmit/s eth1->eth0 222,743 rx/s 0 err,drop/s 222,741 xmit/s eth1->eth0 222,917 rx/s 0 err,drop/s 222,923 xmit/s eth1->eth0 221,810 rx/s 0 err,drop/s 221,808 xmit/s eth1->eth0 222,891 rx/s 0 err,drop/s 222,888 xmit/s eth1->eth0 222,983 rx/s 0 err,drop/s 222,984 xmit/s eth1->eth0 221,655 rx/s 0 err,drop/s 221,653 xmit/s eth1->eth0 222,827 rx/s 0 err,drop/s 222,827 xmit/s eth1->eth0 221,728 rx/s 0 err,drop/s 221,728 xmit/s eth1->eth0 222,790 rx/s 0 err,drop/s 222,789 xmit/s eth1->eth0 222,874 rx/s 0 err,drop/s 222,874 xmit/s eth1->eth0 221,888 rx/s 0 err,drop/s 221,887 xmit/s eth1->eth0 223,057 rx/s 0 err,drop/s 223,056 xmit/s eth1->eth0 222,219 rx/s 0 err,drop/s 222,220 xmit/s Packets received : 3,336,711 Average packets/s : 222,447 Packets transmitted : 3,336,710 Average transmit/s : 222,447 > I also noticed driver does a MMIO write (on rxq) for every RX-packet in > fec_enet_rx_queue() napi-poll loop. This also looks like a potential > performance stall. > The same as txq, the rxq will be inactive if the rx-BDR has no free BDs, so we'd better do a MMIO write when we recycle a BD, so that the hardware can timely attach the received pakcets on the rx-BDR. In addition, I also tried to avoid using xdp_convert_buff_to_frame(), but the performance of XDP_TX is still not improved. :( After these days of testing, I think it's best to keep the solution in V3, and then make some optimizations on the V3 patch.