RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

"Michael Kelley (LINUX)" <mikelley@xxxxxxxxxxxxx> · Sun, 16 Jan 2022 21:18:11 +0000

From: Long Li <longli@xxxxxxxxxxxxx> Sent: Wednesday, January 12, 2022 4:59 PM
>
> > Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots
> > with parameters affecting NUMA topology
> >
> > From: Long Li <longli@xxxxxxxxxxxxx> Sent: Friday, January 7, 2022 12:32 PM
> > > >
> > > > From: longli@xxxxxxxxxxxxxxxxx <longli@xxxxxxxxxxxxxxxxx> Sent:
> > > > Thursday, January 6, 2022 3:20 PM
> > > > >
> > > > > When the kernel boots with parameters restricting the number of
> > > > > cpus or NUMA nodes, e.g. maxcpus=X or numa=off, the vPCI driver
> > > > > should only set to the NUMA node to a value that is valid in the current running kernel.
> > > > >
> > > > > Signed-off-by: Long Li <longli@xxxxxxxxxxxxx>
> > > > > ---
> > > > >  drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
> > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > > > b/drivers/pci/controller/pci- hyperv.c index
> > > > > fc1a29acadbb..8686343eff4c 100644
> > > > > --- a/drivers/pci/controller/pci-hyperv.c
> > > > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > > > @@ -1835,8 +1835,21 @@ static void hv_pci_assign_numa_node(struct
> > > > > hv_pcibus_device *hbus)
> > > > >  		if (!hv_dev)
> > > > >  			continue;
> > > > >
> > > > > -		if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > > > > -			set_dev_node(&dev->dev, hv_dev->desc.virtual_numa_node);
> > > > > +		if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> > > > > +			int cpu;
> > > > > +			bool found_node = false;
> > > > > +
> > > > > +			for_each_possible_cpu(cpu)
> > > > > +				if (cpu_to_node(cpu) ==
> > > > > +				    hv_dev->desc.virtual_numa_node) {
> > > > > +					found_node = true;
> > > > > +					break;
> > > > > +				}
> > > > > +
> > > > > +			if (found_node)
> > > > > +				set_dev_node(&dev->dev,
> > > > > +					     hv_dev->desc.virtual_numa_node);
> > > > > +		}
> > > >
> > > > I'm wondering about this approach vs. just comparing against nr_node_ids.
> > >
> > > I was trying to fix this by comparing with nr_node_ids. This worked
> > > for numa=off, but it didn't work with maxcpus=X.
> > >
> > > maxcpus=X is commonly used in kdump kernels. In this config,  the
> > > memory system is initialized in a way that only the NUMA nodes within
> > > maxcpus are setup and can be used by the drivers.
> >
> > In looking at a 5.16 kernel running in a Hyper-V VM on two NUMA nodes, the
> > number of NUMA nodes configured in the kernel is not affected by maxcpus= on
> > the kernel boot line.  This VM has 48 vCPUs and 2 NUMA nodes, and is
> > Generation 2.  Even with maxcpus=4 or maxcpus=1, these lines are output during
> > boot:
> >
> > [    0.238953] NODE_DATA(0) allocated [mem 0x7edffd5000-0x7edfffffff]
> > [    0.241397] NODE_DATA(1) allocated [mem 0xfcdffd4000-0xfcdfffefff]
> >
> > and
> >
> > [    0.280039] Initmem setup node 0 [mem 0x0000000000001000-0x0000007edfffffff]
> > [    0.282869] Initmem setup node 1 [mem 0x0000007ee0000000-0x000000fcdfffffff]
> >
> > It's perfectly legit to have a NUMA node with memory but no CPUs.  The
> > memory assigned to the NUMA node is determined by the ACPI SRAT.  So I'm
> > wondering what is causing the kdump issue you see.  Or maybe the behavior of
> > older kernels is different.
> 
> Sorry, it turns out I had a typo. It's nr_cpus=1 (not maxcpus). But I'm not sure if that
> matters as the descriptions on these two in the kernel doc are the same.
> 
> On my system (4 NUMA nodes) with kdump boot line:  (maybe if you try a VM with 4
> NUMA nodes, you can see the problem)
> [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-1025-azure
> root=PARTUUID=7145c36d-e182-43b6-a37e-0b6d18fef8fe ro console=tty1 console=ttyS0
> earlyprintk=ttyS0 reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1
> irqpoll nousb ata_piix.prefer_ms_hyperv=0 elfcorehdr=4038049140K
> 
> I see the following:
> [    0.408246] NODE_DATA(0) allocated [mem 0x2cfd6000-0x2cffffff]
> [    0.410454] NODE_DATA(3) allocated [mem 0x3c2bef32000-0x3c2bef5bfff]
> [    0.413031] Zone ranges:
> [    0.414117]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> [    0.416522]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> [    0.418932]   Normal   [mem 0x0000000100000000-0x000003c2bef5cfff]
> [    0.421357]   Device   empty
> [    0.422454] Movable zone start for each node
> [    0.424109] Early memory node ranges
> [    0.425541]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
> [    0.428050]   node   0: [mem 0x000000001d000000-0x000000002cffffff]
> [    0.430547]   node   3: [mem 0x000003c27f000000-0x000003c2bef5cfff]
> [    0.432963] Initmem setup node 0 [mem 0x0000000000001000-0x000000002cffffff]
> [    0.435695] Initmem setup node 3 [mem 0x000003c27f000000-0x000003c2bef5cfff]
> [    0.438446] On node 0, zone DMA: 1 pages in unavailable ranges
> [    0.439377] On node 0, zone DMA32: 53088 pages in unavailable ranges
> [    0.452784] On node 3, zone Normal: 40960 pages in unavailable ranges
> [    0.455221] On node 3, zone Normal: 4259 pages in unavailable ranges
> 
> It's unclear to me why node 1 and 2 are missing. But I don't think it's a Hyper-V problem
> since it's only affected by setting nr_cpus over kernel boot line. Later, a device driver
> (mlx5 in this example) tries to allocate memory on node 1 and fails:
> 

To summarize some offline conversation, we've figured out that
the "missing" NUMA nodes are not due to setting maxcpus=1 or
nr_cpus=1.  Setting the cpu count doesn't affect any of this.

Instead, Linux is modifying the memory map prior to starting the
kdump kernel so that most of the memory is not touched and is
preserved to be dumped, which is the whole point of kdump.   This
modified memory map has no memory in NUMA nodes 1 and 2, so
it is correct to just see nodes 0 and 3 as online.

I think code fix here is pretty simple:

	int node;

	node = hv_dev->desc.virtual_numa_node;
	if ((hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
			&& (node < nr_node_ids))
		set_dev_node(&dev->dev, numa_map_to_online_node(node));

Michael