Paolo Abeni wrote: > On Tue, 2024-04-16 at 11:21 +0200, Paolo Abeni wrote: >> On Fri, 2024-04-12 at 17:55 +0200, Richard Gobert wrote: >>> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, >>> iph->id, ...) against all packets in a loop. These flush checks are used >>> currently in all tcp flows and in some UDP flows in GRO. >>> >>> These checks need to be done only once and only against the found p skb, >>> since they only affect flush and not same_flow. >>> >>> Leveraging the previous commit in the series, in which correct network >>> header offsets are saved for both outer and inner network headers - >>> allowing these checks to be done only once, in tcp_gro_receive and >>> udp_gro_receive_segment. As a result, NAPI_GRO_CB(p)->flush is not used at >>> all. In addition, flush_id checks are more declarative and contained in >>> inet_gro_flush, thus removing the need for flush_id in napi_gro_cb. >>> >>> This results in less parsing code for UDP flows and non-loop flush tests >>> for TCP flows. >>> >>> To make sure results are not within noise range - I've made netfilter drop >>> all TCP packets, and measured CPU performance in GRO (in this case GRO is >>> responsible for about 50% of the CPU utilization). >>> >>> L3 flush/flush_id checks are not relevant to UDP connections where >>> skb_gro_receive_list is called. The only code change relevant to this flow >>> is inet_gro_receive. The rest of the code parsing this flow stays the >>> same. >>> >>> All concurrent connections tested are with the same ip srcaddr and >>> dstaddr. >>> >>> perf top while replaying 64 concurrent IP/UDP connections (UDP fwd flow): >>> net-next: >>> 3.03% [kernel] [k] inet_gro_receive >>> >>> patch applied: >>> 2.78% [kernel] [k] inet_gro_receive >> >> Why there are no figures for >> udp_gro_receive_segment()/gro_network_flush() here? >> >> Also you should be able to observer a very high amount of CPU usage by >> GRO even with TCP with very high speed links, keeping the BH/GRO on a >> CPU and the user-space/data copy on a different one (or using rx zero >> copy). > > To be more explicit: I think at least the above figures are required, > and I still fear the real gain in that case would range from zero to > negative. > I wrote about it in the commit message in short, sorry if I wasn't clear enough. gro_network_flush is compiled in-line to both udp_gro_receive_segment and tcp_gro_receive. udp_gro_receive_segment is compiled in-line to udp_gro_receive. The UDP numbers I posted are not relevant anymore after Willem and Alexander's thread, after which we understood flush and flush_id should be calculated for all UDP flows. I can post new numbers for the UDP fwd path after implementing the correct change. As for TCP - the numbers I posted stay the same. You should note there is an increase in CPU utilization in tcp_gro_receive because of the inline compilation of gro_network_flush. The numbers make sense and show performance enhancement in the case I showed when both inet_gro_receive and tcp_gro_receive are accounted for. > If you can't do the TCP part of the testing, please provide at least > the figures for a single UDP flow, that should give more indication WRT > the result we can expect with TCP. > > Note that GRO is used mainly by TCP and TCP packets with different > src/dst port will land into different GRO hash buckets, having > different RX hash. > > That will happen even for UDP, at least for some (most?) nics include > the UDP ports in the RX hash. > > Thanks, > > Paolo >