RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

Long Li <longli@xxxxxxxxxxxxx> · Thu, 13 Jan 2022 00:59:13 +0000

> Subject: RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots
> with parameters affecting NUMA topology
> 
> From: Long Li <longli@xxxxxxxxxxxxx> Sent: Friday, January 7, 2022 12:32 PM
> > >
> > > From: longli@xxxxxxxxxxxxxxxxx <longli@xxxxxxxxxxxxxxxxx> Sent:
> > > Thursday, January 6, 2022 3:20 PM
> > > >
> > > > When the kernel boots with parameters restricting the number of
> > > > cpus or NUMA nodes, e.g. maxcpus=X or numa=off, the vPCI driver
> > > > should only set to the NUMA node to a value that is valid in the current
> running kernel.
> > > >
> > > > Signed-off-by: Long Li <longli@xxxxxxxxxxxxx>
> > > > ---
> > > >  drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
> > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > > b/drivers/pci/controller/pci- hyperv.c index
> > > > fc1a29acadbb..8686343eff4c 100644
> > > > --- a/drivers/pci/controller/pci-hyperv.c
> > > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > > @@ -1835,8 +1835,21 @@ static void hv_pci_assign_numa_node(struct
> > > > hv_pcibus_device *hbus)
> > > >  		if (!hv_dev)
> > > >  			continue;
> > > >
> > > > -		if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > > > -			set_dev_node(&dev->dev, hv_dev-
> >desc.virtual_numa_node);
> > > > +		if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> > > > +			int cpu;
> > > > +			bool found_node = false;
> > > > +
> > > > +			for_each_possible_cpu(cpu)
> > > > +				if (cpu_to_node(cpu) ==
> > > > +				    hv_dev->desc.virtual_numa_node) {
> > > > +					found_node = true;
> > > > +					break;
> > > > +				}
> > > > +
> > > > +			if (found_node)
> > > > +				set_dev_node(&dev->dev,
> > > > +					     hv_dev->desc.virtual_numa_node);
> > > > +		}
> > >
> > > I'm wondering about this approach vs. just comparing against nr_node_ids.
> >
> > I was trying to fix this by comparing with nr_node_ids. This worked
> > for numa=off, but it didn't work with maxcpus=X.
> >
> > maxcpus=X is commonly used in kdump kernels. In this config,  the
> > memory system is initialized in a way that only the NUMA nodes within
> > maxcpus are setup and can be used by the drivers.
> 
> In looking at a 5.16 kernel running in a Hyper-V VM on two NUMA nodes, the
> number of NUMA nodes configured in the kernel is not affected by maxcpus= on
> the kernel boot line.  This VM has 48 vCPUs and 2 NUMA nodes, and is
> Generation 2.  Even with maxcpus=4 or maxcpus=1, these lines are output during
> boot:
> 
> [    0.238953] NODE_DATA(0) allocated [mem 0x7edffd5000-0x7edfffffff]
> [    0.241397] NODE_DATA(1) allocated [mem 0xfcdffd4000-0xfcdfffefff]
> 
> and
> 
> [    0.280039] Initmem setup node 0 [mem 0x0000000000001000-
> 0x0000007edfffffff]
> [    0.282869] Initmem setup node 1 [mem 0x0000007ee0000000-
> 0x000000fcdfffffff]
> 
> It's perfectly legit to have a NUMA node with memory but no CPUs.  The
> memory assigned to the NUMA node is determined by the ACPI SRAT.  So I'm
> wondering what is causing the kdump issue you see.  Or maybe the behavior of
> older kernels is different.

Sorry, it turns out I had a typo. It's nr_cpus=1 (not maxcpus). But I'm not sure if that matters as the descriptions on these two in the kernel doc are the same.

On my system (4 NUMA nodes) with kdump boot line:  (maybe if you try a VM with 4 NUMA nodes, you can see the problem)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-1025-azure root=PARTUUID=7145c36d-e182-43b6-a37e-0b6d18fef8fe ro console=tty1 console=ttyS0 earlyprintk=ttyS0 reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0 elfcorehdr=4038049140K

I see the following:
[    0.408246] NODE_DATA(0) allocated [mem 0x2cfd6000-0x2cffffff]
[    0.410454] NODE_DATA(3) allocated [mem 0x3c2bef32000-0x3c2bef5bfff]
[    0.413031] Zone ranges:
[    0.414117]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.416522]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.418932]   Normal   [mem 0x0000000100000000-0x000003c2bef5cfff]
[    0.421357]   Device   empty
[    0.422454] Movable zone start for each node
[    0.424109] Early memory node ranges
[    0.425541]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
[    0.428050]   node   0: [mem 0x000000001d000000-0x000000002cffffff]
[    0.430547]   node   3: [mem 0x000003c27f000000-0x000003c2bef5cfff]
[    0.432963] Initmem setup node 0 [mem 0x0000000000001000-0x000000002cffffff]
[    0.435695] Initmem setup node 3 [mem 0x000003c27f000000-0x000003c2bef5cfff]
[    0.438446] On node 0, zone DMA: 1 pages in unavailable ranges
[    0.439377] On node 0, zone DMA32: 53088 pages in unavailable ranges
[    0.452784] On node 3, zone Normal: 40960 pages in unavailable ranges
[    0.455221] On node 3, zone Normal: 4259 pages in unavailable ranges

It's unclear to me why node 1 and 2 are missing. But I don't think it's a Hyper-V problem since it's only affected by setting nr_cpus over kernel boot line. Later, a device driver (mlx5 in this example) tries to allocate memory on node 1 and fails:

[  137.348836] BUG: unable to handle page fault for address: 0000000000001cc8
[  137.352224] #PF: supervisor read access in kernel mode
[  137.354884] #PF: error_code(0x0000) - not-present page
[  137.357440] PGD 0 P4D 0 
[  137.358976] Oops: 0000 [#1] SMP NOPTI
[  137.361072] CPU: 0 PID: 445 Comm: systemd-udevd Not tainted 5.11.0-1025-azure #27~20.04.1-Ubuntu
[  137.365160] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 10/27/2020
[  137.369915] RIP: 0010:__alloc_pages_nodemask+0x10e/0x300
[  137.372590] Code: ff ff 84 c0 0f 85 21 01 00 00 44 89 e0 48 8b 55 b0 8b 75 c4 c1 e8 0c 48 8b 7d a8 83 e0 01 88 45 c8 48 85 d2 0f 85 44 01 00 00 <3b> 77 08 0f 82 3b 01 00 00 48 89 7d b8 48 8b 07 44 89 e2 81 e2 00
[  137.381311] RSP: 0018:ffffabcd4137b858 EFLAGS: 00010246
[  137.383941] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  137.387397] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000001cc0
[  137.390724] RBP: ffffabcd4137b8b0 R08: ffffffffffffffff R09: ffffffffffffffff
[  137.394347] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000cc0
[  137.397698] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc0
[  137.401055] FS:  00007fefb6620880(0000) GS:ffff90103ec00000(0000) knlGS:0000000000000000
[  137.404970] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  137.407888] CR2: 0000000000001cc8 CR3: 000003c280d78000 CR4: 0000000000350ef0
[  137.411294] Call Trace:
[  137.412882]  __dma_direct_alloc_pages+0x8e/0x120
[  137.415399]  dma_direct_alloc+0x66/0x2a0
[  137.417517]  dma_alloc_attrs+0x3e/0x50
[  137.419590]  mlx5_cmd_init+0xe5/0x570 [mlx5_core]

> 
> >
> > > Comparing against nr_node_ids would handle the case of numa=off on
> > > the kernel boot line, or a kernel built with CONFIG_NUMA=n, or the
> > > use of numa=fake.  Your approach is also affected by which CPUs are
> > > online, since
> > > cpu_to_node() references percpu data.  It would seem to produce more
> > > variable results since CPUs can go online and offline while the VM is running.
> > > If a network VF device was removed and re-added, the results of your
> > > algorithm could be different for the re-add, depending on which CPUs
> > > were online at the time.
> > >
> > > My impression (which may be incorrect) is that the device numa_node
> > > is primarily to allow the driver to allocate memory from the closest
> > > NUMA node, and such memory allocations don't really need to be
> > > affected by which CPUs are online.
> >
> > Yes, this is the reason I'm using for_each_possible_cpu(). Even if
> > some CPUs are not online, the memory system is setup in a way that
> > allow driver to allocate memory on that NUMA node. The algorithm
> > guarantees the value of NUMA node is valid when calling set_dev_node().
> >
> 
> I'm thinking the code here should check against nr_node_ids, to catch the
> numa=off  or CONFIG_NUMA=n cases.  Then could use either node_online() or
> numa_map_to_online_node(), but I'm still curious as to how we would get an
> offline NUMA node given how Hyper-V normally sets up a VM.
> 
> NUMA nodes only transition from online to offline if there are no CPUs or
> memory assigned.  That can happen if the CPUs are taken offline (or never came
> online) and if the memory is hot-removed.  We don't currently support hot-
> remove memory in Hyper-V VMs, though there has been some discussion about
> adding it.  I'm not sure how that case is supposed to be handled if the NUMA
> node is stashed in some device and get used during dma_alloc_coherent(), for
> example.  That seems to be a general Linux problem unless there's a mechanism
> for handling it that I haven't noticed.
> 
> Michael