Le Mon, 15 Nov 2021 14:31:06 +0000, Vladimir Oltean <vladimir.oltean@xxxxxxx> a écrit : > On Mon, Nov 15, 2021 at 03:06:20PM +0100, Clément Léger wrote: > > Le Mon, 15 Nov 2021 06:08:00 -0800, > > Jakub Kicinski <kuba@xxxxxxxxxx> a écrit : > > > > > On Mon, 15 Nov 2021 11:13:44 +0100 Clément Léger wrote: > > > > Test on standard packets with UDP (iperf3 -t 100 -l 1460 -u -b 0 -c *) > > > > - With pre-computed header: UDP TX: 33Mbit/s > > > > - Without UDP TX: 31Mbit/s > > > > -> 6.5% improvement > > > > > > > > Test on small packets with UDP (iperf3 -t 100 -l 700 -u -b 0 -c *) > > > > - With pre-computed header: UDP TX: 15.8Mbit/s > > > > - Without UDP TX: 16.4Mbit/s > > > > -> 4.3% improvement > > > > > > Something's wrong with these numbers or I'm missing context. > > > You say improvement in both cases yet in the latter case the > > > new number is lower? > > > > You are right Jakub, I swapped the last two results, > > > > Test on small packets with UDP (iperf3 -t 100 -l 700 -u -b 0 -c *) > > - With pre-computed header: UDP TX: 16.4Mbit/s > > - Without UDP TX: 15.8Mbit/s > > -> 4.3% improvement > > Even in reverse, something still seems wrong with the numbers. > My DSPI controller can transfer at a higher data rate than that. > Where is the rest of the time spent? Computing checksums? While adding FDMA support, I was surprised by the low performances I encountered so I spent some times trying to understand and find where the time was spent. First, I ran a iperf in loopback (using lo) and it yielded the following results (of course RX/TX runs on the same CPU in this case): TCP (iperf3 -c localhost): - RX/TX: 84.0Mbit/s UDP (iperf3 -u -b 0 -c localhost): - RX/TX: 65.0Mbit/s So even in localhost mode, the CPU is already really slow and can only sustain a really small "throughput". I then tried to check the performances using the CPU based injection/extraction, and I obtained the following results: TCP: (iperf3 -u -b 0 -c) - TX: 11.8MBit/s - RX: 21.6Mbit/s UDP (iperf3 -u -b 0 -c) - TX: 13.4Mbit/s - RX: Not even possible, CPU never succeed to extract a single packet I then tried to find where was the time spent with ftrace (I kept only the relevant functions that consume most of the time), the following results were recorded when using iperf3 with CPU based injection/extraction. In TCP TX, a lot of time is spent doing copy from user: 41.71% iperf3 [kernel.kallsyms] [k] __raw_copy_to_user 6.65% iperf3 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 3.23% iperf3 [kernel.kallsyms] [k] do_ade 2.10% iperf3 [kernel.kallsyms] [k] __ocelot_write_ix 2.10% iperf3 [kernel.kallsyms] [k] handle_adel_int ... In TCP RX, numbers are even worse for the time spent in __raw_copy_to_user: 62.95% iperf3 [kernel.kallsyms] [k] __raw_copy_to_user 1.97% iperf3 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 1.15% iperf3 [kernel.kallsyms] [k] __copy_page_start 1.07% iperf3 [kernel.kallsyms] [k] __skb_datagram_iter ... In UDP TX, some time is spent handling locking and unaligned copies as well as pushing packets. Unaligned copies are due to the driver accessing all directly the bytes of the packets as word whhich might be bad when there is misalignement. 17.97% iperf3 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 11.94% iperf3 [kernel.kallsyms] [k] do_ade 9.07% iperf3 [kernel.kallsyms] [k] __ocelot_write_ix 7.74% iperf3 [kernel.kallsyms] [k] handle_adel_int 5.78% iperf3 [kernel.kallsyms] [k] copy_from_kernel_nofault 4.71% iperf3 [kernel.kallsyms] [k] __compute_return_epc_for_insn 2.51% iperf3 [kernel.kallsyms] [k] regmap_write 2.31% iperf3 [kernel.kallsyms] [k] __compute_return_epc ... In UDP RX (iperf3 with -b 5M to ensure packets are received), time is spent in floating point emulation and other various function. 7.26% iperf3 [kernel.kallsyms] [k] cop1Emulate 2.84% iperf3 [kernel.kallsyms] [k] do_select 2.08% iperf3 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 2.06% iperf3 [kernel.kallsyms] [k] fpu_emulator_cop1Handler 2.01% iperf3 [kernel.kallsyms] [k] tcp_poll 2.00% iperf3 [kernel.kallsyms] [k] __raw_copy_to_user When using the FDMA, the results are the following: In TCP TX, copy from user is still present and checksuming takes quite some time. 31.31% iperf3 [kernel.kallsyms] [k] __raw_copy_to_user 10.48% iperf3 [kernel.kallsyms] [k] __csum_partial_copy_to_user 3.73% iperf3 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 2.08% iperf3 [kernel.kallsyms] [k] tcp_ack 1.68% iperf3 [kernel.kallsyms] [k] ocelot_fdma_napi_poll 1.63% iperf3 [kernel.kallsyms] [k] tcp_write_xmit 1.05% iperf3 [kernel.kallsyms] [k] finish_task_switch In TCP RX, the majority of time is still taken by __raw_copy_to_user. 63.95%[[m iperf3 [kernel.kallsyms] [k] __raw_copy_to_user 1.29%[[m iperf3 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 1.23%[[m iperf3 [kernel.kallsyms] [k] tcp_recvmsg_locked 1.23%[[m iperf3 [kernel.kallsyms] [k] __skb_datagram_iter 1.07%[[m iperf3 [kernel.kallsyms] [k] vfs_read In UDP TX, time is spent in softirq entry and in checksuming. 9.01% iperf3 [kernel.kallsyms] [k] __softirqentry_text_start 7.07% iperf3 [kernel.kallsyms] [k] __csum_partial_copy_to_user 2.28% iperf3 [kernel.kallsyms] [k] __ip_append_data.isra.0 2.10% iperf3 [kernel.kallsyms] [k] __dev_queue_xmit 2.08% iperf3 [kernel.kallsyms] [k] siphash_3u32 2.06% iperf3 [kernel.kallsyms] [k] udp_sendmsg And in UDP RX, again, time is spent in floating point emulation and cheksuming. 10.33% iperf3 [kernel.kallsyms] [k] cop1Emulate 7.62% iperf3 [kernel.kallsyms] [k] csum_partial 3.32% iperf3 [kernel.kallsyms] [k] do_select 2.69% iperf3 [kernel.kallsyms] [k] ieee754dp_sub 2.68% iperf3 [kernel.kallsyms] [k] fpu_emulator_cop1Handler 2.56% iperf3 [kernel.kallsyms] [k] ieee754dp_add 2.33% iperf3 [kernel.kallsyms] [k] ieee754dp_div After all these measurements, the CPU appears to be the bottleneck and simply spend a lot of time in various functions. I did not went further using perf events since there was no real reason to dig up more in that way. -- Clément Léger, Embedded Linux and Kernel engineer at Bootlin https://bootlin.com