> Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots > with parameters affecting NUMA topology > > From: Long Li <longli@xxxxxxxxxxxxx> Sent: Tuesday, January 18, 2022 2:44 PM > > > > > > From: Long Li <longli@xxxxxxxxxxxxx> Sent: Wednesday, January 12, > > > 2022 4:59 PM > > > > > > > > > Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when > > > > > kernel boots with parameters affecting NUMA topology > > > > > > > > > > From: Long Li <longli@xxxxxxxxxxxxx> Sent: Friday, January 7, > > > > > 2022 > > > > > 12:32 PM > > > > > > > > > > > > > > From: longli@xxxxxxxxxxxxxxxxx <longli@xxxxxxxxxxxxxxxxx> Sent: > > > > > > > Thursday, January 6, 2022 3:20 PM > > > > > > > > > > > > > > > > When the kernel boots with parameters restricting the > > > > > > > > number of cpus or NUMA nodes, e.g. maxcpus=X or numa=off, > > > > > > > > the vPCI driver should only set to the NUMA node to a > > > > > > > > value that is valid in the > > > current running kernel. > > > > > > > > > > > > > > > > Signed-off-by: Long Li <longli@xxxxxxxxxxxxx> > > > > > > > > --- > > > > > > > > drivers/pci/controller/pci-hyperv.c | 17 > > > > > > > > +++++++++++++++-- > > > > > > > > 1 file changed, 15 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > diff --git a/drivers/pci/controller/pci-hyperv.c > > > > > > > > b/drivers/pci/controller/pci- hyperv.c index > > > > > > > > fc1a29acadbb..8686343eff4c 100644 > > > > > > > > --- a/drivers/pci/controller/pci-hyperv.c > > > > > > > > +++ b/drivers/pci/controller/pci-hyperv.c > > > > > > > > @@ -1835,8 +1835,21 @@ static void > > > > > > > > hv_pci_assign_numa_node(struct hv_pcibus_device *hbus) > > > > > > > > if (!hv_dev) > > > > > > > > continue; > > > > > > > > > > > > > > > > - if (hv_dev->desc.flags & > > > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) > > > > > > > > - set_dev_node(&dev->dev, hv_dev- > > > >desc.virtual_numa_node); > > > > > > > > + if (hv_dev->desc.flags & > > > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) { > > > > > > > > + int cpu; > > > > > > > > + bool found_node = false; > > > > > > > > + > > > > > > > > + for_each_possible_cpu(cpu) > > > > > > > > + if (cpu_to_node(cpu) == > > > > > > > > + hv_dev->desc.virtual_numa_node) { > > > > > > > > + found_node = true; > > > > > > > > + break; > > > > > > > > + } > > > > > > > > + > > > > > > > > + if (found_node) > > > > > > > > + set_dev_node(&dev->dev, > > > > > > > > + hv_dev- > > > >desc.virtual_numa_node); > > > > > > > > + } > > > > > > > > > > > > > > I'm wondering about this approach vs. just comparing against > > > nr_node_ids. > > > > > > > > > > > > I was trying to fix this by comparing with nr_node_ids. This > > > > > > worked for numa=off, but it didn't work with maxcpus=X. > > > > > > > > > > > > maxcpus=X is commonly used in kdump kernels. In this config, > > > > > > the memory system is initialized in a way that only the NUMA > > > > > > nodes within maxcpus are setup and can be used by the drivers. > > > > > > > > > > In looking at a 5.16 kernel running in a Hyper-V VM on two NUMA > > > > > nodes, the number of NUMA nodes configured in the kernel is not > > > > > affected by maxcpus= on the kernel boot line. This VM has 48 > > > > > vCPUs and 2 NUMA nodes, and is Generation 2. Even with > > > > > maxcpus=4 or maxcpus=1, these lines are output during > > > > > boot: > > > > > > > > > > [ 0.238953] NODE_DATA(0) allocated [mem 0x7edffd5000-0x7edfffffff] > > > > > [ 0.241397] NODE_DATA(1) allocated [mem 0xfcdffd4000-0xfcdfffefff] > > > > > > > > > > and > > > > > > > > > > [ 0.280039] Initmem setup node 0 [mem 0x0000000000001000- > > > 0x0000007edfffffff] > > > > > [ 0.282869] Initmem setup node 1 [mem 0x0000007ee0000000- > > > 0x000000fcdfffffff] > > > > > > > > > > It's perfectly legit to have a NUMA node with memory but no CPUs. > > > > > The memory assigned to the NUMA node is determined by the ACPI SRAT. > > > > > So I'm wondering what is causing the kdump issue you see. Or > > > > > maybe the behavior of older kernels is different. > > > > > > > > Sorry, it turns out I had a typo. It's nr_cpus=1 (not maxcpus). > > > > But I'm not sure if that matters as the descriptions on these two > > > > in the kernel doc > > > are the same. > > > > > > > > On my system (4 NUMA nodes) with kdump boot line: (maybe if you > > > > try a VM with 4 NUMA nodes, you can see the problem) > > > > [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-1025- > azure > > > > root=PARTUUID=7145c36d-e182-43b6-a37e-0b6d18fef8fe ro console=tty1 > > > > console=ttyS0 > > > > earlyprintk=ttyS0 reset_devices > > > > systemd.unit=kdump-tools-dump.service > > > > nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0 > > > > elfcorehdr=4038049140K > > > > > > > > I see the following: > > > > [ 0.408246] NODE_DATA(0) allocated [mem 0x2cfd6000-0x2cffffff] > > > > [ 0.410454] NODE_DATA(3) allocated [mem 0x3c2bef32000- > 0x3c2bef5bfff] > > > > [ 0.413031] Zone ranges: > > > > [ 0.414117] DMA [mem 0x0000000000001000-0x0000000000ffffff] > > > > [ 0.416522] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] > > > > [ 0.418932] Normal [mem 0x0000000100000000-0x000003c2bef5cfff] > > > > [ 0.421357] Device empty > > > > [ 0.422454] Movable zone start for each node > > > > [ 0.424109] Early memory node ranges > > > > [ 0.425541] node 0: [mem 0x0000000000001000-0x000000000009ffff] > > > > [ 0.428050] node 0: [mem 0x000000001d000000-0x000000002cffffff] > > > > [ 0.430547] node 3: [mem 0x000003c27f000000-0x000003c2bef5cfff] > > > > [ 0.432963] Initmem setup node 0 [mem 0x0000000000001000- > > > 0x000000002cffffff] > > > > [ 0.435695] Initmem setup node 3 [mem 0x000003c27f000000- > > > 0x000003c2bef5cfff] > > > > [ 0.438446] On node 0, zone DMA: 1 pages in unavailable ranges > > > > [ 0.439377] On node 0, zone DMA32: 53088 pages in unavailable ranges > > > > [ 0.452784] On node 3, zone Normal: 40960 pages in unavailable ranges > > > > [ 0.455221] On node 3, zone Normal: 4259 pages in unavailable ranges > > > > > > > > It's unclear to me why node 1 and 2 are missing. But I don't think > > > > it's a Hyper-V problem since it's only affected by setting nr_cpus > > > > over kernel boot line. Later, a device driver > > > > (mlx5 in this example) tries to allocate memory on node 1 and fails: > > > > > > > > > > To summarize some offline conversation, we've figured out that the > "missing" > > > NUMA nodes are not due to setting maxcpus=1 or nr_cpus=1. Setting > > > the cpu count doesn't affect any of this. > > > > > > Instead, Linux is modifying the memory map prior to starting the > > > kdump kernel so that most of the memory is not touched and is > > > preserved to be dumped, which is the whole point of kdump. This > > > modified memory map has no memory in NUMA nodes 1 and 2, so it is > > > correct to just see nodes 0 and 3 as online. > > > > > > I think code fix here is pretty simple: > > > > > > int node; > > > > > > node = hv_dev->desc.virtual_numa_node; > > > if ((hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) > > > && (node < nr_node_ids)) > > > set_dev_node(&dev->dev, numa_map_to_online_node(node)); > > > > > > Michael > > > > Okay, this looks good. > > > > I'm sending a V2 (with a minor change) after testing is done. > > > > Long > > Please leave a comment in the code as to why a NUMA node might be > offline. In the future, somebody new might not know what can happen. > I certainly didn't. :-( Will do that. Long