On Sun, Nov 3, 2024 at 12:21 PM Jakub Kicinski <kuba@xxxxxxxxxx> wrote: > > On Wed, 30 Oct 2024 18:23:26 -0600 Caleb Sander Mateos wrote: > > In a heavy TCP workload, mlx5e_handle_rx_dim() consumes 3% of CPU time, > > 94% of which is attributed to the first push instruction to copy > > dim_sample on the stack for the call to net_dim(): > > Change itself looks fine, so we can apply, but this seems surprising. > Are you sure this is not just some measurement problem? > Do you see 3% higher PPS with this change applied? Agreed, this bottleneck surprised me too. But the CPU profiles clearly point to this push instruction in mlx5e_handle_rx_dim() being very hot. My best explanation is that the 2- and 4-byte stores followed immediately by 8-byte loads from the same addresses cannot be pipelined effectively. The loads must wait for the stores to complete before reading back the values they wrote. Ideally the compiler would recognize that the struct dim_sample local variable is only used to pass to net_dim() and avoid duplicating it. I guess passing large structs by value in C is not very common, so there probably isn't as much effort put into optimizing it. With the patches applied, the CPU time spent in mlx5e_handle_rx_dim() (excluding children) drops from 3.14% to 0.08%. Unfortunately, there are other bottlenecks in the system and 1% variation in the throughput is typical, so the patches don't translate into a clear 3% increase in throughput. Best, Caleb