Re: [PATCH] net/mlx5: Use cpumask_local_spread() instead of custom code

Erwan Velu <erwanaliasr1@xxxxxxxxx> · Mon, 19 Aug 2024 12:15:10 +0200

[...]
> You may be interested in siblings-aware CPU distribution I've made
> for mana ethernet driver in 91bfe210e196. This is also an example
> where using for_each_numa_hop_mask() over simple cpumask_local_spread()
> is justified.

That's clearly a topic I'd like to discuss because the allocation
strategy may vary depending on the hardware and/or usage.
I've been investigating a case where the default mlx5 allocation isn't
what I need.

1/ I noticed that using the smp_affinity in an RFS context didn't
change the IRQ allocation and I was wondering if that is an expected
behavior.
This prevents any later tuning that an application could require.
It would be super helpful to be able to influence the placement from
the host to avoid hardcoded allocators that may not match a particular
hardware configuration.

2/ I was also wondering if we shouldn't have a kernel module option to
choose the allocation algorithm (I have a POC in that direction).
The benefit could be allowing the platform owner to select the
allocation algorithm that sys-admin needs.
On single-package AMD EPYC servers, the numa topology is pretty handy
for mapping the L3 affinity but it doesn't provide any particular hint
about the actual "distance" to the network device.
You can have up to 12 NUMA nodes on a single package but the actual
distance to the nic is almost identical as each core needs to use the
IOdie to reach the PCI devices.
We can see in the NUMA allocation logic assumptions like "1 NUMA per
package" logic that the actual distance between nodes should be
considered in the allocation logic.

In my case, the NIC is reported to Numa node 6 (of 8) (inherited from
the PXM configuration).
With the current "proximity" logic all cores are consumed within this
numa domain before reaching the next ones and so on.
This leads to a very unbalanced configuration where a few numa domains
are fully allocated when others are free.
When SMT is enabled, consuming all cores from a NUMA domain also means
using hyperthreads which could be less optimal than using real cores
from adjacent nodes.

In a hypervisor-like use case, when multiple containers from various
users run on the same system, having RFS enabled helps to have each
user have its own toil of generating traffic.
In such a configuration, it'd be better to let the allocator consume
cores from each numa node of the same package one by one to get a
balanced configuration, that would also have the advantage of avoiding
consuming hyperthreads until at least 1 IRQ per physical core is
reached.

That allocation logic could be interesting to be shared between
various drivers to allow sys admins to get a balanced IRQ mapping on
modern, multi-nodes per socket, architecture.

WDYT of having selectable logic and add this type of
"package-balanced" allocator?
Erwan,