Re: [PATCH V3 net-next] net: fec: add XDP_TX feature support

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Mon, 7 Aug 2023 15:15:22 +0200

On 07/08/2023 12.30, Wei Fang wrote:
The flow-control was not disabled before, so according to your
suggestion, I disable the flow-control on the both boards and run the
test again, the performance is slightly improved, but still can not
see a clear difference between the two methods. Below are the results.
Something else must be stalling the CPU.
When looking at fec_main.c code, I noticed that
fec_enet_txq_xmit_frame() will do a MMIO write for every xdp_frame (to
trigger transmit start), which I believe will stall the CPU.
The ndo_xdp_xmit/fec_enet_xdp_xmit does bulking, and should be the
function that does the MMIO write to trigger transmit start.

We'd better keep a MMIO write for every xdp_frame on txq, as you know,
the txq will be inactive when no additional ready descriptors remain in the
tx-BDR. So it may increase the delay of the packets if we do a MMIO write
for multiple packets.

You know this hardware better than me, so I will leave to you.

$ git diff

diff --git a/drivers/net/ethernet/freescale/fec_main.c
b/drivers/net/ethernet/freescale/fec_main.c
index 03ac7690b5c4..57a6a3899b80 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -3849,9 +3849,6 @@ static int fec_enet_txq_xmit_frame(struct
fec_enet_private *fep,

          txq->bd.cur = bdp;

-       /* Trigger transmission start */
-       writel(0, txq->bd.reg_desc_active);
-
          return 0;
   }

@@ -3880,6 +3877,9 @@ static int fec_enet_xdp_xmit(struct net_device
*dev,
                  sent_frames++;
          }

+       /* Trigger transmission start */
+       writel(0, txq->bd.reg_desc_active);
+
          __netif_tx_unlock(nq);

          return sent_frames;


Result: use "sync_dma_len" method
root@imx8mpevk:~# ./xdp2 eth0
The xdp2 (and xdp1) program(s) have a performance issue (due to using

Can I ask you to test using xdp_rxq_info, like:

   sudo ./xdp_rxq_info --dev mlx5p1 --action XDP_TX

Yes, below are the results, the results are also basically the same.
Result 1: current method
./xdp_rxq_info --dev eth0 --action XDP_TX
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,102     0
XDP-RX CPU      total   259,102
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   259,102     0
rx_queue_index    0:sum 259,102
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,498     0
XDP-RX CPU      total   259,498
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   259,496     0
rx_queue_index    0:sum 259,496
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,408     0
XDP-RX CPU      total   259,408

Result 2: dma_sync_len method
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       258,254     0
XDP-RX CPU      total   258,254
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   258,254     0
rx_queue_index    0:sum 258,254
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,316     0
XDP-RX CPU      total   259,316
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   259,318     0
rx_queue_index    0:sum 259,318
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,554     0
XDP-RX CPU      total   259,554
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   259,553     0
rx_queue_index    0:sum 259,553

Thanks for running this.

proto 17:     258886 pkt/s
proto 17:     258879 pkt/s
If you provide numbers for xdp_redirect, then we could better evaluate if
changing the lock per xdp_frame, for XDP_TX also, is worth it.

For XDP_REDIRECT, the performance show as follow.
root@imx8mpevk:~# ./xdp_redirect eth1 eth0
Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
This is not exactly the same as XDP_TX setup as here you choose to 
redirect between eth1 (driver st_gmac) and to eth0 (driver fec).
I would like to see eth0 to eth0 XDP_REDIRECT, so we can compare to 
XDP_TX performance.
Sorry for all the requests, but can you provide those numbers?

eth1->eth0        221,642 rx/s       0 err,drop/s      221,643 xmit/s
So, XDP_REDIRECT is approx (1-(221825/259554))*100 = 14.53% slower.
But as this is 'eth1->eth0' this isn't true comparison to XDP_TX.

eth1->eth0        221,761 rx/s       0 err,drop/s      221,760 xmit/s
eth1->eth0        221,793 rx/s       0 err,drop/s      221,794 xmit/s
eth1->eth0        221,825 rx/s       0 err,drop/s      221,825 xmit/s
eth1->eth0        221,823 rx/s       0 err,drop/s      221,821 xmit/s
eth1->eth0        221,815 rx/s       0 err,drop/s      221,816 xmit/s
eth1->eth0        222,016 rx/s       0 err,drop/s      222,016 xmit/s
eth1->eth0        222,059 rx/s       0 err,drop/s      222,059 xmit/s
eth1->eth0        222,085 rx/s       0 err,drop/s      222,089 xmit/s
eth1->eth0        221,956 rx/s       0 err,drop/s      221,952 xmit/s
eth1->eth0        222,070 rx/s       0 err,drop/s      222,071 xmit/s
eth1->eth0        222,017 rx/s       0 err,drop/s      222,017 xmit/s
eth1->eth0        222,069 rx/s       0 err,drop/s      222,067 xmit/s
eth1->eth0        221,986 rx/s       0 err,drop/s      221,987 xmit/s
eth1->eth0        221,932 rx/s       0 err,drop/s      221,936 xmit/s
eth1->eth0        222,045 rx/s       0 err,drop/s      222,041 xmit/s
eth1->eth0        222,014 rx/s       0 err,drop/s      222,014 xmit/s
   Packets received    : 3,772,908
   Average packets/s   : 221,936
   Packets transmitted : 3,772,908
   Average transmit/s  : 221,936
And also find out of moving the MMIO write have any effect.

I move the MMIO write to fec_enet_xdp_xmit(), the result shows as follow,
the performance is slightly improved.

I'm puzzled that moving the MMIO write isn't change performance.

Can you please verify that the packet generator machine is sending more
frame than the system can handle?

(meaning the pktgen_sample03_burst_single_flow.sh script fast enough?)

root@imx8mpevk:~# ./xdp_redirect eth1 eth0
Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
eth1->eth0        222,666 rx/s        0 err,drop/s      222,668 xmit/s
eth1->eth0        221,663 rx/s        0 err,drop/s      221,664 xmit/s
eth1->eth0        222,743 rx/s        0 err,drop/s      222,741 xmit/s
eth1->eth0        222,917 rx/s        0 err,drop/s      222,923 xmit/s
eth1->eth0        221,810 rx/s        0 err,drop/s      221,808 xmit/s
eth1->eth0        222,891 rx/s        0 err,drop/s      222,888 xmit/s
eth1->eth0        222,983 rx/s        0 err,drop/s      222,984 xmit/s
eth1->eth0        221,655 rx/s        0 err,drop/s      221,653 xmit/s
eth1->eth0        222,827 rx/s        0 err,drop/s      222,827 xmit/s
eth1->eth0        221,728 rx/s        0 err,drop/s      221,728 xmit/s
eth1->eth0        222,790 rx/s        0 err,drop/s      222,789 xmit/s
eth1->eth0        222,874 rx/s        0 err,drop/s      222,874 xmit/s
eth1->eth0        221,888 rx/s        0 err,drop/s      221,887 xmit/s
eth1->eth0        223,057 rx/s        0 err,drop/s      223,056 xmit/s
eth1->eth0        222,219 rx/s        0 err,drop/s      222,220 xmit/s
   Packets received    : 3,336,711
   Average packets/s   : 222,447
   Packets transmitted : 3,336,710
   Average transmit/s  : 222,447

I also noticed driver does a MMIO write (on rxq) for every RX-packet in
fec_enet_rx_queue() napi-poll loop.  This also looks like a potential
performance stall.

The same as txq, the rxq will be inactive if the rx-BDR has no free BDs, so we'd
better do a MMIO write when we recycle a BD, so that the hardware can timely
attach the received pakcets on the rx-BDR.

In addition, I also tried to avoid using xdp_convert_buff_to_frame(), but the
performance of XDP_TX is still not improved. :(

I would not expect much performance improvement from this anyhow.

After these days of testing, I think it's best to keep the solution in V3, and then
make some optimizations on the V3 patch.
I agree.

I think you need to send a V5, and then I can ACK that.

Thanks for all this testing,
--Jesper