From: Jesper Dangaard Brouer <hawk@xxxxxxxxxx> Date: Tue, 7 Jan 2025 18:17:06 +0100 > Awesome work! - some questions below > > On 07/01/2025 16.29, Alexander Lobakin wrote: >> Several months ago, I had been looking through my old XDP hints tree[0] >> to check whether some patches not directly related to hints can be sent >> standalone. Roughly at the same time, Daniel appeared and asked[1] about >> GRO for cpumap from that tree. >> >> Currently, cpumap uses its own kthread which processes cpumap-redirected >> frames by batches of 8, without any weighting (but with rescheduling >> points). The resulting skbs get passed to the stack via >> netif_receive_skb_list(), which means no GRO happens. >> Even though we can't currently pass checksum status from the drivers, >> in many cases GRO performs better than the listified Rx without the >> aggregation, confirmed by tests. >> >> In order to enable GRO in cpumap, we need to do the following: >> >> * patches 1-2: decouple the GRO struct from the NAPI struct and allow >> using it out of a NAPI entity within the kernel core code; >> * patch 3: switch cpumap from netif_receive_skb_list() to >> gro_receive_skb(). >> >> Additional improvements: >> >> * patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked >> lists; >> * patch 5-6: introduce and use function do get skbs from the NAPI percpu >> caches by bulks, not one at a time; >> * patch 7-8: use that function in veth as well and remove the one that >> was now superseded by it. >> >> My trafficgen UDP GRO tests, small frame sizes: >> > > How does your trafficgen UDP test manage to get UDP GRO working? > (Perhaps you can share test?) I usually test as follows: xdp-trafficgen from xdp-tools on the sender then, on the receiver: ethtool -K <iface> rx-udp-gro-forwarding on No socket on the receiver, but this option enables GRO not only when forwarding, but also when it's LOCAL_IN and there's just no socket. Then, the UDP core drops the frame when doing sk lookup as there's no socket. IOW, I have the following: * GRO gets performed * Stack overhead is there, up to UDP lookup * The final frame is dropped, so no userspace copy overhead. > > What is the "small frame" size being used? xdp-trafficgen currently hardcodes frame sizes to 64 bytes. I was planning to add an option to configure frame size and send it upstream, but never finished it yet unfortunately. I realize that on bigger frames, the boosts won't be as big due to that the CPU will have to calculate checksums for larger buffers. OTOH TCP benches usually send MTU-sized buffers (+ TSO), but yet the perf is better. > > Is the UDP benchmark avoiding (re)calculating the RX checksum? > (via setting UDP csum to zero) OH, I completely forgot about this one. I can imagine even bigger boosts due to that CPU checksumming will disappear. > >> GRO off GRO on >> baseline 2.7 N/A Mpps >> patch 3 2.3 4 Mpps >> patch 8 2.4 4.7 Mpps >> >> 1...3 diff -17 +48 % >> 1...8 diff -11 +74 % >> >> Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR >> tests. On my system however, the same test gave me up to +100%. >> > > I can imagine that the TCP throughput tests will yield a huge > performance boost. > >> Note that there's a series from Lorenzo[4] which achieves the same, but >> in a different way. During the discussions, the approach using a >> standalone GRO instance was preferred over the threaded NAPI. >> > > It looks like you are keeping the "remote" CPUMAP kthread process design > intact in this series, right? Right, the kthread logic remains the same as before. > > I think this design works for our use-case. For our use-case, we want to > give "remote" CPU-thread higher scheduling priority. It doesn't matter > if this is a kthread or threaded-NAPI thread, as long as we can see this > as a PID from userspace (by which we adjust the sched priority). > > Great to see this work progressing again :-))) > --Jesper Thanks, Olek