Re: [PATCH] net/mlx5: Use cpumask_local_spread() instead of custom code

Jakub Kicinski <kuba@xxxxxxxxxx> · Mon, 19 Aug 2024 08:34:26 -0700

On Mon, 19 Aug 2024 12:15:10 +0200 Erwan Velu wrote:
> 2/ I was also wondering if we shouldn't have a kernel module option to
> choose the allocation algorithm (I have a POC in that direction).
> The benefit could be allowing the platform owner to select the
> allocation algorithm that sys-admin needs.
> On single-package AMD EPYC servers, the numa topology is pretty handy
> for mapping the L3 affinity but it doesn't provide any particular hint
> about the actual "distance" to the network device.
> You can have up to 12 NUMA nodes on a single package but the actual
> distance to the nic is almost identical as each core needs to use the
> IOdie to reach the PCI devices.
> We can see in the NUMA allocation logic assumptions like "1 NUMA per
> package" logic that the actual distance between nodes should be
> considered in the allocation logic.

I think user space has more information on what the appropriate
placement is than the kernel. We can have a reasonable default,
and maybe try not to stupidly reset the settings when config
changes (I don't think mlx5 does that but other drivers do);
but having a way to select algorithm would only work if there
was a well understood and finite set of algorithms.

IMHO we should try to sell this task to systemd-networkd or some other 
user space daemon. We now have netlink access to NAPI information,
including IRQ<>NAPI<>queue mapping. It's possible to implement a
completely driver-agnostic IRQ mapping support from user space
(without the need to grep irq names like we used to)