On 12 May 2022, at 3:15, Jonathan Cameron wrote: > On Wed, 11 May 2022 19:44:14 +0200 > "Rafael J. Wysocki" <rafael@xxxxxxxxxx> wrote: > >> On Wed, May 11, 2022 at 7:42 PM Jonathan Lemon <jonathan.lemon@xxxxxxxxx> wrote: >>> >>> On 11 May 2022, at 10:33, Rafael J. Wysocki wrote: >>> >>>> On Wed, May 11, 2022 at 7:24 PM Jonathan Lemon <jonathan.lemon@xxxxxxxxx> wrote: >>>>> >>>>> This reverts commit a62d07e0006a3a3ce77041ca07f3c488ec880790. >>>>> >>>>> The change calls pxm_to_node(), which ends up returning -1 >>>>> (NUMA_NO_NODE) on some systems for the pci bus, as opposed >>>>> to the prior call to acpi_map_pxm_to_node(), which returns 0. >>>>> >>>>> The default numa node is then inherited by all pci devices, and is >>>>> visible in /sys/bus/pci/devices/*/numa_node >>>>> >>>>> The prior behavior shows: >>>>> # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c >>>>> 122 0 >>>>> >>>>> While the new behavior has: >>>>> # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c >>>>> 1 0 > > Curious, which device is turning up in node 0? Oddly enough, the NVME drive: 01:00.0 Non-Volatile memory controller: SK hynix PC401 NVMe Solid State Drive 256GB (prog-if 02 [NVM Express]) Subsystem: SK hynix PC401 NVMe Solid State Drive 256GB NUMA node: 0 These are single-socket Skylake DE platforms. >>>>> >>>>> While arguably NUMA_NO_NODE is correct on single-socket systems which >>>>> have only one numa domain, this breaks scripts that attempt to read the >>>>> NIC numa_node and pass that to numactl in order to pin memory allocation >>>>> when running applications (like iperf). E.g.: >>>>> >>>>> # numactl -p -1 iperf3 >>>>> libnuma: Warning: node argument -1 is out of range >>>>> <-1> is invalid >>>>> >>>>> Reverting this change restores the prior behavior. >>>> >>>> Well, that's not a recent commit and it fixed a real and serious issue. >>>> >>>> Isn't there a way to fix this other than reverting it? >>> >>> The userspace behavior changed - is there another way to fix things >>> so that a valid numa_node is returned? >> >> Well, that's my question. This also could be a BIOS issue that wasn’t noticed until the platforms were updated to a newer kernel. — Jonathan > As Rafael noted, we don't want to change the internal kernel representation because > previous kernel behavior resulting in several paths where you could > get NULL pointer de-references, but maybe we could special case > it at the userspace boundary. > > e.g. override dev_to_node() return value here > https://elixir.bootlin.com/linux/v5.18-rc6/source/drivers/pci/pci-sysfs.c#L358 > > What's problematic is we missed this being being an issue until now and hence > have shipping kernels with both behaviors. > > +CC Bjorn and linux-pci > > Jonathan