[...] > You may be interested in siblings-aware CPU distribution I've made > for mana ethernet driver in 91bfe210e196. This is also an example > where using for_each_numa_hop_mask() over simple cpumask_local_spread() > is justified. That's clearly a topic I'd like to discuss because the allocation strategy may vary depending on the hardware and/or usage. I've been investigating a case where the default mlx5 allocation isn't what I need. 1/ I noticed that using the smp_affinity in an RFS context didn't change the IRQ allocation and I was wondering if that is an expected behavior. This prevents any later tuning that an application could require. It would be super helpful to be able to influence the placement from the host to avoid hardcoded allocators that may not match a particular hardware configuration. 2/ I was also wondering if we shouldn't have a kernel module option to choose the allocation algorithm (I have a POC in that direction). The benefit could be allowing the platform owner to select the allocation algorithm that sys-admin needs. On single-package AMD EPYC servers, the numa topology is pretty handy for mapping the L3 affinity but it doesn't provide any particular hint about the actual "distance" to the network device. You can have up to 12 NUMA nodes on a single package but the actual distance to the nic is almost identical as each core needs to use the IOdie to reach the PCI devices. We can see in the NUMA allocation logic assumptions like "1 NUMA per package" logic that the actual distance between nodes should be considered in the allocation logic. In my case, the NIC is reported to Numa node 6 (of 8) (inherited from the PXM configuration). With the current "proximity" logic all cores are consumed within this numa domain before reaching the next ones and so on. This leads to a very unbalanced configuration where a few numa domains are fully allocated when others are free. When SMT is enabled, consuming all cores from a NUMA domain also means using hyperthreads which could be less optimal than using real cores from adjacent nodes. In a hypervisor-like use case, when multiple containers from various users run on the same system, having RFS enabled helps to have each user have its own toil of generating traffic. In such a configuration, it'd be better to let the allocator consume cores from each numa node of the same package one by one to get a balanced configuration, that would also have the advantage of avoiding consuming hyperthreads until at least 1 IRQ per physical core is reached. That allocation logic could be interesting to be shared between various drivers to allow sys admins to get a balanced IRQ mapping on modern, multi-nodes per socket, architecture. WDYT of having selectable logic and add this type of "package-balanced" allocator? Erwan,