On Wed, 11 May 2022 19:44:14 +0200 "Rafael J. Wysocki" <rafael@xxxxxxxxxx> wrote: > On Wed, May 11, 2022 at 7:42 PM Jonathan Lemon <jonathan.lemon@xxxxxxxxx> wrote: > > > > On 11 May 2022, at 10:33, Rafael J. Wysocki wrote: > > > > > On Wed, May 11, 2022 at 7:24 PM Jonathan Lemon <jonathan.lemon@xxxxxxxxx> wrote: > > >> > > >> This reverts commit a62d07e0006a3a3ce77041ca07f3c488ec880790. > > >> > > >> The change calls pxm_to_node(), which ends up returning -1 > > >> (NUMA_NO_NODE) on some systems for the pci bus, as opposed > > >> to the prior call to acpi_map_pxm_to_node(), which returns 0. > > >> > > >> The default numa node is then inherited by all pci devices, and is > > >> visible in /sys/bus/pci/devices/*/numa_node > > >> > > >> The prior behavior shows: > > >> # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c > > >> 122 0 > > >> > > >> While the new behavior has: > > >> # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c > > >> 1 0 Curious, which device is turning up in node 0? > > >> 121 -1 > > >> > > >> While arguably NUMA_NO_NODE is correct on single-socket systems which > > >> have only one numa domain, this breaks scripts that attempt to read the > > >> NIC numa_node and pass that to numactl in order to pin memory allocation > > >> when running applications (like iperf). E.g.: > > >> > > >> # numactl -p -1 iperf3 > > >> libnuma: Warning: node argument -1 is out of range > > >> <-1> is invalid > > >> > > >> Reverting this change restores the prior behavior. > > > > > > Well, that's not a recent commit and it fixed a real and serious issue. > > > > > > Isn't there a way to fix this other than reverting it? > > > > The userspace behavior changed - is there another way to fix things > > so that a valid numa_node is returned? > > Well, that's my question. As Rafael noted, we don't want to change the internal kernel representation because previous kernel behavior resulting in several paths where you could get NULL pointer de-references, but maybe we could special case it at the userspace boundary. e.g. override dev_to_node() return value here https://elixir.bootlin.com/linux/v5.18-rc6/source/drivers/pci/pci-sysfs.c#L358 What's problematic is we missed this being being an issue until now and hence have shipping kernels with both behaviors. +CC Bjorn and linux-pci Jonathan