> Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots > with parameters affecting NUMA topology > > From: longli@xxxxxxxxxxxxxxxxx <longli@xxxxxxxxxxxxxxxxx> Sent: > Thursday, January 6, 2022 3:20 PM > > > > When the kernel boots with parameters restricting the number of cpus > > or NUMA nodes, e.g. maxcpus=X or numa=off, the vPCI driver should only > > set to the NUMA node to a value that is valid in the current running kernel. > > > > Signed-off-by: Long Li <longli@xxxxxxxxxxxxx> > > --- > > drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++-- > > 1 file changed, 15 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/pci/controller/pci-hyperv.c > > b/drivers/pci/controller/pci- hyperv.c index > > fc1a29acadbb..8686343eff4c 100644 > > --- a/drivers/pci/controller/pci-hyperv.c > > +++ b/drivers/pci/controller/pci-hyperv.c > > @@ -1835,8 +1835,21 @@ static void hv_pci_assign_numa_node(struct > > hv_pcibus_device *hbus) > > if (!hv_dev) > > continue; > > > > - if (hv_dev->desc.flags & > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) > > - set_dev_node(&dev->dev, hv_dev- > >desc.virtual_numa_node); > > + if (hv_dev->desc.flags & > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) { > > + int cpu; > > + bool found_node = false; > > + > > + for_each_possible_cpu(cpu) > > + if (cpu_to_node(cpu) == > > + hv_dev->desc.virtual_numa_node) { > > + found_node = true; > > + break; > > + } > > + > > + if (found_node) > > + set_dev_node(&dev->dev, > > + hv_dev- > >desc.virtual_numa_node); > > + } > > I'm wondering about this approach vs. just comparing against nr_node_ids. I was trying to fix this by comparing with nr_node_ids. This worked for numa=off, but it didn't work with maxcpus=X. maxcpus=X is commonly used in kdump kernels. In this config, the memory system is initialized in a way that only the NUMA nodes within maxcpus are setup and can be used by the drivers. > Comparing against nr_node_ids would handle the case of numa=off on the > kernel boot line, or a kernel built with CONFIG_NUMA=n, or the use of > numa=fake. Your approach is also affected by which CPUs are online, since > cpu_to_node() references percpu data. It would seem to produce more > variable results since CPUs can go online and offline while the VM is running. > If a network VF device was removed and re-added, the results of your > algorithm could be different for the re-add, depending on which CPUs were > online at the time. > > My impression (which may be incorrect) is that the device numa_node is > primarily to allow the driver to allocate memory from the closest NUMA node, > and such memory allocations don't really need to be affected by which CPUs > are online. Yes, this is the reason I'm using for_each_possible_cpu(). Even if some CPUs are not online, the memory system is setup in a way that allow driver to allocate memory on that NUMA node. The algorithm guarantees the value of NUMA node is valid when calling set_dev_node(). > > Thoughts? > > > > > put_pcichild(hv_dev); > > } > > -- > > 2.25.1