On 07/08/2023 12.30, Wei Fang wrote:
The flow-control was not disabled before, so according to your
suggestion, I disable the flow-control on the both boards and run the
test again, the performance is slightly improved, but still can not
see a clear difference between the two methods. Below are the results.
Something else must be stalling the CPU.
When looking at fec_main.c code, I noticed that
fec_enet_txq_xmit_frame() will do a MMIO write for every xdp_frame (to
trigger transmit start), which I believe will stall the CPU.
The ndo_xdp_xmit/fec_enet_xdp_xmit does bulking, and should be the
function that does the MMIO write to trigger transmit start.
We'd better keep a MMIO write for every xdp_frame on txq, as you know,
the txq will be inactive when no additional ready descriptors remain in the
tx-BDR. So it may increase the delay of the packets if we do a MMIO write
for multiple packets.
You know this hardware better than me, so I will leave to you.
$ git diff
diff --git a/drivers/net/ethernet/freescale/fec_main.c
b/drivers/net/ethernet/freescale/fec_main.c
index 03ac7690b5c4..57a6a3899b80 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -3849,9 +3849,6 @@ static int fec_enet_txq_xmit_frame(struct
fec_enet_private *fep,
txq->bd.cur = bdp;
- /* Trigger transmission start */
- writel(0, txq->bd.reg_desc_active);
-
return 0;
}
@@ -3880,6 +3877,9 @@ static int fec_enet_xdp_xmit(struct net_device
*dev,
sent_frames++;
}
+ /* Trigger transmission start */
+ writel(0, txq->bd.reg_desc_active);
+
__netif_tx_unlock(nq);
return sent_frames;
Result: use "sync_dma_len" method
root@imx8mpevk:~# ./xdp2 eth0
The xdp2 (and xdp1) program(s) have a performance issue (due to using
Can I ask you to test using xdp_rxq_info, like:
sudo ./xdp_rxq_info --dev mlx5p1 --action XDP_TX
Yes, below are the results, the results are also basically the same.
Result 1: current method
./xdp_rxq_info --dev eth0 --action XDP_TX
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats CPU pps issue-pps
XDP-RX CPU 0 259,102 0
XDP-RX CPU total 259,102
RXQ stats RXQ:CPU pps issue-pps
rx_queue_index 0:0 259,102 0
rx_queue_index 0:sum 259,102
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats CPU pps issue-pps
XDP-RX CPU 0 259,498 0
XDP-RX CPU total 259,498
RXQ stats RXQ:CPU pps issue-pps
rx_queue_index 0:0 259,496 0
rx_queue_index 0:sum 259,496
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats CPU pps issue-pps
XDP-RX CPU 0 259,408 0
XDP-RX CPU total 259,408
Result 2: dma_sync_len method
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats CPU pps issue-pps
XDP-RX CPU 0 258,254 0
XDP-RX CPU total 258,254
RXQ stats RXQ:CPU pps issue-pps
rx_queue_index 0:0 258,254 0
rx_queue_index 0:sum 258,254
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats CPU pps issue-pps
XDP-RX CPU 0 259,316 0
XDP-RX CPU total 259,316
RXQ stats RXQ:CPU pps issue-pps
rx_queue_index 0:0 259,318 0
rx_queue_index 0:sum 259,318
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats CPU pps issue-pps
XDP-RX CPU 0 259,554 0
XDP-RX CPU total 259,554
RXQ stats RXQ:CPU pps issue-pps
rx_queue_index 0:0 259,553 0
rx_queue_index 0:sum 259,553
Thanks for running this.
proto 17: 258886 pkt/s
proto 17: 258879 pkt/s
If you provide numbers for xdp_redirect, then we could better evaluate if
changing the lock per xdp_frame, for XDP_TX also, is worth it.
For XDP_REDIRECT, the performance show as follow.
root@imx8mpevk:~# ./xdp_redirect eth1 eth0
Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
This is not exactly the same as XDP_TX setup as here you choose to
redirect between eth1 (driver st_gmac) and to eth0 (driver fec).
I would like to see eth0 to eth0 XDP_REDIRECT, so we can compare to
XDP_TX performance.
Sorry for all the requests, but can you provide those numbers?
eth1->eth0 221,642 rx/s 0 err,drop/s 221,643 xmit/s
So, XDP_REDIRECT is approx (1-(221825/259554))*100 = 14.53% slower.
But as this is 'eth1->eth0' this isn't true comparison to XDP_TX.
eth1->eth0 221,761 rx/s 0 err,drop/s 221,760 xmit/s
eth1->eth0 221,793 rx/s 0 err,drop/s 221,794 xmit/s
eth1->eth0 221,825 rx/s 0 err,drop/s 221,825 xmit/s
eth1->eth0 221,823 rx/s 0 err,drop/s 221,821 xmit/s
eth1->eth0 221,815 rx/s 0 err,drop/s 221,816 xmit/s
eth1->eth0 222,016 rx/s 0 err,drop/s 222,016 xmit/s
eth1->eth0 222,059 rx/s 0 err,drop/s 222,059 xmit/s
eth1->eth0 222,085 rx/s 0 err,drop/s 222,089 xmit/s
eth1->eth0 221,956 rx/s 0 err,drop/s 221,952 xmit/s
eth1->eth0 222,070 rx/s 0 err,drop/s 222,071 xmit/s
eth1->eth0 222,017 rx/s 0 err,drop/s 222,017 xmit/s
eth1->eth0 222,069 rx/s 0 err,drop/s 222,067 xmit/s
eth1->eth0 221,986 rx/s 0 err,drop/s 221,987 xmit/s
eth1->eth0 221,932 rx/s 0 err,drop/s 221,936 xmit/s
eth1->eth0 222,045 rx/s 0 err,drop/s 222,041 xmit/s
eth1->eth0 222,014 rx/s 0 err,drop/s 222,014 xmit/s
Packets received : 3,772,908
Average packets/s : 221,936
Packets transmitted : 3,772,908
Average transmit/s : 221,936
And also find out of moving the MMIO write have any effect.
I move the MMIO write to fec_enet_xdp_xmit(), the result shows as follow,
the performance is slightly improved.
I'm puzzled that moving the MMIO write isn't change performance.
Can you please verify that the packet generator machine is sending more
frame than the system can handle?
(meaning the pktgen_sample03_burst_single_flow.sh script fast enough?)
root@imx8mpevk:~# ./xdp_redirect eth1 eth0
Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
eth1->eth0 222,666 rx/s 0 err,drop/s 222,668 xmit/s
eth1->eth0 221,663 rx/s 0 err,drop/s 221,664 xmit/s
eth1->eth0 222,743 rx/s 0 err,drop/s 222,741 xmit/s
eth1->eth0 222,917 rx/s 0 err,drop/s 222,923 xmit/s
eth1->eth0 221,810 rx/s 0 err,drop/s 221,808 xmit/s
eth1->eth0 222,891 rx/s 0 err,drop/s 222,888 xmit/s
eth1->eth0 222,983 rx/s 0 err,drop/s 222,984 xmit/s
eth1->eth0 221,655 rx/s 0 err,drop/s 221,653 xmit/s
eth1->eth0 222,827 rx/s 0 err,drop/s 222,827 xmit/s
eth1->eth0 221,728 rx/s 0 err,drop/s 221,728 xmit/s
eth1->eth0 222,790 rx/s 0 err,drop/s 222,789 xmit/s
eth1->eth0 222,874 rx/s 0 err,drop/s 222,874 xmit/s
eth1->eth0 221,888 rx/s 0 err,drop/s 221,887 xmit/s
eth1->eth0 223,057 rx/s 0 err,drop/s 223,056 xmit/s
eth1->eth0 222,219 rx/s 0 err,drop/s 222,220 xmit/s
Packets received : 3,336,711
Average packets/s : 222,447
Packets transmitted : 3,336,710
Average transmit/s : 222,447
I also noticed driver does a MMIO write (on rxq) for every RX-packet in
fec_enet_rx_queue() napi-poll loop. This also looks like a potential
performance stall.
The same as txq, the rxq will be inactive if the rx-BDR has no free BDs, so we'd
better do a MMIO write when we recycle a BD, so that the hardware can timely
attach the received pakcets on the rx-BDR.
In addition, I also tried to avoid using xdp_convert_buff_to_frame(), but the
performance of XDP_TX is still not improved. :(
I would not expect much performance improvement from this anyhow.
After these days of testing, I think it's best to keep the solution in V3, and then
make some optimizations on the V3 patch.
I agree.
I think you need to send a V5, and then I can ACK that.
Thanks for all this testing,
--Jesper