Re: [fake numa not working] [PATCH 1/1] mm/fake-numa: allow later numa node hotplug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for my late answer, I was out traveling.

Le 20/01/2025 à 04:27, Zi Yan a écrit :
On Sun Jan 19, 2025 at 9:37 PM EST, Hyeonggon Yoo wrote:


On 1/20/2025 11:16 AM, Zi Yan wrote:
On Sun Jan 19, 2025 at 7:37 PM EST, Hyeonggon Yoo wrote:


On 1/18/2025 4:09 AM, Zi Yan wrote:
On 17 Jan 2025, at 0:00, Hyeonggon Yoo wrote:

On 1/6/2025 9:06 PM, Bruno Faccini wrote:
Current fake-numa implementation prevents new Numa
nodes to be later hot-plugged by drivers.
A common symptom of this limitation is the
"node <X> was absent from the node_possible_map"
message by associated warning in mm/memory_hotplug.c:
add_memory_resource().
This comes from the lack of remapping in both
pxm_to_node_map[] and node_to_pxm_map[] tables
to take fake-numa nodes into account and thus
triggers collisions with original and physical nodes
only-mapping that had been determined from BIOS tables.
This patch fixes this by doing the necessary node-ids
translation in both pxm_to_node_map[]/node_to_pxm_map[]
tables.
node_distance[] table has also been fixed accordingly.

Signed-off-by: Bruno Faccini <bfaccini@xxxxxxxxxx>

Hi Bruno,

This commit causes WARN() and disables fakenuma
when I pass "numa=fake=4U" on my QEMU setup.

Attaching WARN() log and git bisect log to help debugging.

Is your VM getting 4 NUMA nodes at the end?

No, it only gets node 0 at the end.

Got it.

Yes, it is expected that if something fails, we get back to the orig config.


Can you also share your QEMU command line and your kernel config?

The config is attached (fakenuma.config)

QEMU command line:
$ qemu-system-x86_64 \
           -cpu host \
           -enable-kvm \
           -smp $(nproc) \
           -m 32G \
           -nographic \
           -kernel arch/x86/boot/bzImage \
           -initrd ../../rootfs.cpio.gz \
           -nic user,model=virtio-net-pci \
           -append "console=ttyS0 numa=fake=4U"

This can help Bruno reproduce the issue locally. At least, I
cannot reproduce it on mm-everything-2025-01-16-23-18.

Still reproduced on mm-everything-2025-01-18-04-55 with my setup.

I also
notice that my VM does not have “No NUMA configuration found”
(in your log) in its kernel log and that might explain
why I could not reproduce it.

Did you do NUMA configuration (-numa in QEMU) for your VM?
I didn't, but it's still expected to work 'cause it's fake numa.

No. It seems that the "No NUMA configuration found" is caused by your
qemu, where dummy_numa_init() is used (maybe ACPI SRAT is not
populated?)

FYI, Yes. ACPI SRAT is not populated.

$ qemu-system-x86_64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.24)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

Basically, fix_pxm_node_maps() breaks fakenuma when it cannot fix the
pxm_to_node_map() due to ACPI SRAT is missing. IMHO, it should let
fakenuma proceed with a warning. The diff below fixed the issue
locally, but Bruno can tell if I miss anything.
Right, I did not anticipate that SRAT information is not provided ...


diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 59fffe34c9d0..18d04f054e93 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -95,9 +95,12 @@ int __init fix_pxm_node_maps(int max_nid)
  	int i, j, index = -1, count = 0;
  	nodemask_t nodes_to_enable;
- if (numa_off || srat_disabled())
+	if (numa_off)
  		return -1;
+ if (srat_disabled())
+		return 0;
+
  	/* find fake nodes PXM mapping */
  	for (i = 0; i < MAX_NUMNODES; i++) {
  		if (node_to_pxm_map[i] != PXM_INVAL) {
@@ -117,9 +120,9 @@ int __init fix_pxm_node_maps(int max_nid)
  			}
  		}
  	}
-	if (WARN(index != max_nid, "%d max nid  when expected %d\n",
-		      index, max_nid))
-		return -1;
+	/* No srat data */
+	if (index != max_nid)
+		return 0;
Yes, I also believe that removing the warning and return is the way to handle missing SRAT informations case.

I will push a patch update later today.

Many thanks to have done the necessary work to identify/fix this guys !!!

  	nodes_clear(nodes_to_enable);

I used the patch below to emulate the situation and
reproduced the issue. Since dummy_numa_init() is used when there is
no underlying NUMA architecture, NUMA initialization fails, or NUMA
is disabled on the command line (the last does not apply here),
it should be properly handled. I will let Bruno look into it.

Thank you for the info.

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 64e5cdb2460a..70dfd9bfb23f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -223,7 +223,7 @@ static int __init dummy_numa_init(void)
    */
   void __init x86_numa_init(void)
   {
-	if (!numa_off) {
+	if (0 && !numa_off) {
   #ifdef CONFIG_ACPI_NUMA
   		if (!numa_init(x86_acpi_numa_init))
   			return;


[WARN() splat]

[    0.009676] No NUMA configuration found
[    0.009677] Faking a node at [mem 0x0000000000000000-0x000000083fffffff]
[    0.009705] Fake node size 16895MB too small, increasing to 16896MB
[    0.009708] Faking node 0 at [mem 0x0000000000001000-0x0000000420000fff] (16896MB)
[    0.009711] Faking node 1 at [mem 0x0000000420001000-0x000000083fffffff] (16895MB)
[    0.009738] ------------[ cut here ]------------
[    0.009739] -1 max nid  when expected 1
[    0.009760] WARNING: CPU: 0 PID: 0 at drivers/acpi/numa/srat.c:120 fix_pxm_node_maps+0x577/0x810
[    0.009766] Modules linked in:
[    0.009769] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.13.0-rc6+ #31
[    0.009771] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[    0.009772] RIP: 0010:fix_pxm_node_maps+0x577/0x810
[    0.009774] Code: 8b 85 40 ff ff ff 44 8b 8d 44 ff ff ff 8b 85 48 ff ff ff e9 83 fb ff ff 44 89 f2 44 89 ce 48 c7 c7 aa 2f 84 8d e8 b9 71 01 fe <0f> 0b 41 b8 ff ff ff ff e9 ba fb ff ff 48 c7 c7 a0 d8 d1 8d 44 89
[    0.009776] RSP: 0000:ffffffff8da03be0 EFLAGS: 00010046 ORIG_RAX: 0000000000000000
[    0.009778] RAX: 0000000000000000 RBX: 0000000000000400 RCX: 0000000000000000
[    0.009779] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    0.009780] RBP: ffffffff8da03ca8 R08: 0000000000000000 R09: 0000000000000000
[    0.009781] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[    0.009782] R13: 00000000000007ff R14: 0000000000000001 R15: 0000000000000000
[    0.009783] FS:  0000000000000000(0000) GS:ffffffff8de8c000(0000) knlGS:0000000000000000
[    0.009784] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.009785] CR2: ffff90e4bde01000 CR3: 00000001bd03e000 CR4: 00000000000200b0
[    0.009788] Call Trace:
[    0.009790]  <TASK>
[    0.009792]  ? show_regs+0x71/0x90
[    0.009797]  ? __warn+0x8d/0x150
[    0.009799]  ? fix_pxm_node_maps+0x577/0x810
[    0.009802]  ? report_bug+0x1ab/0x1c0
[    0.009805]  ? fixup_exception+0x27/0x380
[    0.009809]  ? early_fixup_exception+0xa2/0xf0
[    0.009813]  ? do_early_exception+0x28/0x90
[    0.009816]  ? early_idt_handler_common+0x2f/0x3a
[    0.009819]  ? fix_pxm_node_maps+0x577/0x810
[    0.009822]  ? numa_cleanup_meminfo+0x8c/0x5b0
[    0.009828]  numa_emulation+0x4e3/0xad0
[    0.009830]  ? __pfx_dummy_numa_init+0x10/0x10
[    0.009833]  ? __pfx_dummy_numa_init+0x10/0x10
[    0.009836]  numa_memblks_init+0x10c/0x2c0
[    0.009838]  ? __pfx_dummy_numa_init+0x10/0x10
[    0.009840]  numa_init+0x61/0x3e0
[    0.009842]  ? topology_register_boot_apic+0x2a/0x40
[    0.009846]  x86_numa_init+0x65/0x80
[    0.009848]  initmem_init+0xe/0x20
[    0.009850]  setup_arch+0x9d8/0xfa0
[    0.009852]  ? _printk+0x58/0x90
[    0.009855]  start_kernel+0x5f/0xb50
[    0.009860]  x86_64_start_reservations+0x18/0x30
[    0.009863]  x86_64_start_kernel+0xc0/0x110
[    0.009864]  ? setup_ghcb+0xe/0x140
[    0.009867]  common_startup_64+0x13e/0x141
[    0.009871]  </TASK>
[    0.009871] ---[ end trace 0000000000000000 ]---

[Git bisect log]

# bad: [f378252a2168c2fbf8fc08b635061e5f6748c1f2] kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
git bisect bad f378252a2168c2fbf8fc08b635061e5f6748c1f2
# good: [cbc5dde0a461240046e8a41c43d7c3b76d5db952] fs/proc: fix softlockup in __read_vmcore (part 2)
git bisect good cbc5dde0a461240046e8a41c43d7c3b76d5db952
# good: [fb73203263021f2ee9dd54d280b1c543d10acd76] mm: move common part of pagetable_*_ctor to helper
git bisect good fb73203263021f2ee9dd54d280b1c543d10acd76
# bad: [8fd966c9310f14d5bbe653cf087853582170aad6] mm, swap: remove old allocation path for HDD
git bisect bad 8fd966c9310f14d5bbe653cf087853582170aad6
# bad: [8646971f02651146554233d63fdd721f5f060973] mm: shmem: skip swapcache for swapin of synchronous swap device
git bisect bad 8646971f02651146554233d63fdd721f5f060973
# good: [b8ba614eaa8dd31702c61e7df7231b1b18b99259] mm/damon/paddr: report filter-passed bytes back for DAMOS_STAT action
git bisect good b8ba614eaa8dd31702c61e7df7231b1b18b99259
# good: [fd0935b8e9e8c2edf05a3d0ffa09d0aa3e9cf2dd] Docs/ABI/damon: document per-region DAMOS filter-passed bytes stat file
git bisect good fd0935b8e9e8c2edf05a3d0ffa09d0aa3e9cf2dd
# good: [3ffd3cd7e2e8793d3b5fbb83c03668c47ab6d599] selftests/damon: remove tests for DAMON debugfs interface
git bisect good 3ffd3cd7e2e8793d3b5fbb83c03668c47ab6d599
# good: [bf9f93f0b7bff8af780330f5094775a39caf3612] mm/damon: remove DAMON debugfs interface
git bisect good bf9f93f0b7bff8af780330f5094775a39caf3612
# bad: [1746744be6ff271e29345a5be4cc144aa33b10ab] mm/memmap: prevent double scanning of memmap by kmemleak
git bisect bad 1746744be6ff271e29345a5be4cc144aa33b10ab
# bad: [ca19a5ab8756ef78c34d02d2b8c266a9cc7bf57d] mm/fake-numa: allow later numa node hotplug
git bisect bad ca19a5ab8756ef78c34d02d2b8c266a9cc7bf57d
# first bad commit: [ca19a5ab8756ef78c34d02d2b8c266a9cc7bf57d] mm/fake-numa: allow later numa node hotplug

---
     drivers/acpi/numa/srat.c     | 86 ++++++++++++++++++++++++++++++++++++
     include/acpi/acpi_numa.h     |  5 +++
     include/linux/numa_memblks.h |  3 ++
     mm/numa_emulation.c          | 45 ++++++++++++++++---
     mm/numa_memblks.c            |  2 +-
     5 files changed, 133 insertions(+), 8 deletions(-)

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index bec0dcd1f9c3..59fffe34c9d0 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -81,6 +81,92 @@ int acpi_map_pxm_to_node(int pxm)
     }
     EXPORT_SYMBOL(acpi_map_pxm_to_node);
    +#ifdef CONFIG_NUMA_EMU
+/*
+ * Take max_nid - 1 fake-numa nodes into account in both
+ * pxm_to_node_map()/node_to_pxm_map[] tables.
+ */
+int __init fix_pxm_node_maps(int max_nid)
+{
+	static int pxm_to_node_map_copy[MAX_PXM_DOMAINS] __initdata
+			= { [0 ... MAX_PXM_DOMAINS - 1] = NUMA_NO_NODE };
+	static int node_to_pxm_map_copy[MAX_NUMNODES] __initdata
+			= { [0 ... MAX_NUMNODES - 1] = PXM_INVAL };
+	int i, j, index = -1, count = 0;
+	nodemask_t nodes_to_enable;
+
+	if (numa_off || srat_disabled())
+		return -1;
+
+	/* find fake nodes PXM mapping */
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		if (node_to_pxm_map[i] != PXM_INVAL) {
+			for (j = 0; j <= max_nid; j++) {
+				if ((emu_nid_to_phys[j] == i) &&
+				    WARN(node_to_pxm_map_copy[j] != PXM_INVAL,
+					 "Node %d is already binded to PXM %d\n",
+					 j, node_to_pxm_map_copy[j]))
+					return -1;
+				if (emu_nid_to_phys[j] == i) {
+					node_to_pxm_map_copy[j] =
+						node_to_pxm_map[i];
+					if (j > index)
+						index = j;
+					count++;
+				}
+			}
+		}
+	}
+	if (WARN(index != max_nid, "%d max nid  when expected %d\n",
+		      index, max_nid))
+		return -1;
+
+	nodes_clear(nodes_to_enable);
+
+	/* map phys nodes not used for fake nodes */
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		if (node_to_pxm_map[i] != PXM_INVAL) {
+			for (j = 0; j <= max_nid; j++)
+				if (emu_nid_to_phys[j] == i)
+					break;
+			/* fake nodes PXM mapping has been done */
+			if (j <= max_nid)
+				continue;
+			/* find first hole */
+			for (j = 0;
+			     j < MAX_NUMNODES &&
+				 node_to_pxm_map_copy[j] != PXM_INVAL;
+			     j++)
+			;
+			if (WARN(j == MAX_NUMNODES,
+			    "Number of nodes exceeds MAX_NUMNODES\n"))
+				return -1;
+			node_to_pxm_map_copy[j] = node_to_pxm_map[i];
+			node_set(j, nodes_to_enable);
+			count++;
+		}
+	}
+
+	/* creating reverse mapping in pxm_to_node_map[] */
+	for (i = 0; i < MAX_NUMNODES; i++)
+		if (node_to_pxm_map_copy[i] != PXM_INVAL &&
+		    pxm_to_node_map_copy[node_to_pxm_map_copy[i]] == NUMA_NO_NODE)
+			pxm_to_node_map_copy[node_to_pxm_map_copy[i]] = i;
+
+	/* overwrite with new mapping */
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		node_to_pxm_map[i] = node_to_pxm_map_copy[i];
+		pxm_to_node_map[i] = pxm_to_node_map_copy[i];
+	}
+
+	/* enable other nodes found in PXM for hotplug */
+	nodes_or(numa_nodes_parsed, nodes_to_enable, numa_nodes_parsed);
+
+	pr_debug("found %d total number of nodes\n", count);
+	return 0;
+}
+#endif
+
     static void __init
     acpi_table_print_srat_entry(struct acpi_subtable_header *header)
     {
diff --git a/include/acpi/acpi_numa.h b/include/acpi/acpi_numa.h
index b5f594754a9e..99b960bd473c 100644
--- a/include/acpi/acpi_numa.h
+++ b/include/acpi/acpi_numa.h
@@ -17,11 +17,16 @@ extern int node_to_pxm(int);
     extern int acpi_map_pxm_to_node(int);
     extern unsigned char acpi_srat_revision;
     extern void disable_srat(void);
+extern int fix_pxm_node_maps(int max_nid);
      extern void bad_srat(void);
     extern int srat_disabled(void);
      #else				/* CONFIG_ACPI_NUMA */
+static inline int fix_pxm_node_maps(int max_nid)
+{
+	return 0;
+}
     static inline void disable_srat(void)
     {
     }
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index cfad6ce7e1bd..dd85613cdd86 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -29,7 +29,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
     int __init numa_memblks_init(int (*init_func)(void),
     			     bool memblock_force_top_down);
    +extern int numa_distance_cnt;
+
     #ifdef CONFIG_NUMA_EMU
+extern int emu_nid_to_phys[MAX_NUMNODES];
     int numa_emu_cmdline(char *str);
     void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
     					unsigned int nr_emu_nids);
diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c
index 031fb9961bf7..9d55679d99ce 100644
--- a/mm/numa_emulation.c
+++ b/mm/numa_emulation.c
@@ -8,11 +8,12 @@
     #include <linux/memblock.h>
     #include <linux/numa_memblks.h>
     #include <asm/numa.h>
+#include <acpi/acpi_numa.h>
      #define FAKE_NODE_MIN_SIZE	((u64)32 << 20)
     #define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1UL))
    -static int emu_nid_to_phys[MAX_NUMNODES];
+int emu_nid_to_phys[MAX_NUMNODES];
     static char *emu_cmdline __initdata;
      int __init numa_emu_cmdline(char *str)
@@ -379,6 +380,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
     	size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
     	int max_emu_nid, dfl_phys_nid;
     	int i, j, ret;
+	nodemask_t physnode_mask = numa_nodes_parsed;
      	if (!emu_cmdline)
     		goto no_emu;
@@ -395,7 +397,6 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
     	 * split the system RAM into N fake nodes.
     	 */
     	if (strchr(emu_cmdline, 'U')) {
-		nodemask_t physnode_mask = numa_nodes_parsed;
     		unsigned long n;
     		int nid = 0;
    @@ -465,9 +466,6 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
     	 */
     	max_emu_nid = setup_emu2phys_nid(&dfl_phys_nid);
    -	/* commit */
-	*numa_meminfo = ei;
-
     	/* Make sure numa_nodes_parsed only contains emulated nodes */
     	nodes_clear(numa_nodes_parsed);
     	for (i = 0; i < ARRAY_SIZE(ei.blk); i++)
@@ -475,10 +473,21 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
     		    ei.blk[i].nid != NUMA_NO_NODE)
     			node_set(ei.blk[i].nid, numa_nodes_parsed);
    -	numa_emu_update_cpu_to_node(emu_nid_to_phys, ARRAY_SIZE(emu_nid_to_phys));
+	/* fix pxm_to_node_map[] and node_to_pxm_map[] to avoid collision
+	 * with faked numa nodes, particularly during later memory hotplug
+	 * handling, and also update numa_nodes_parsed accordingly.
+	 */
+	ret = fix_pxm_node_maps(max_emu_nid);
+	if (ret < 0)
+		goto no_emu;
+
+	/* commit */
+	*numa_meminfo = ei;
+
+	numa_emu_update_cpu_to_node(emu_nid_to_phys, max_emu_nid + 1);
      	/* make sure all emulated nodes are mapped to a physical node */
-	for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
+	for (i = 0; i < max_emu_nid + 1; i++)
     		if (emu_nid_to_phys[i] == NUMA_NO_NODE)
     			emu_nid_to_phys[i] = dfl_phys_nid;
    @@ -501,12 +510,34 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
     			numa_set_distance(i, j, dist);
     		}
     	}
+	for (i = 0; i < numa_distance_cnt; i++) {
+		for (j = 0; j < numa_distance_cnt; j++) {
+			int physi, physj;
+			u8 dist;
+
+			/* distance between fake nodes is already ok */
+			if (emu_nid_to_phys[i] != NUMA_NO_NODE &&
+			    emu_nid_to_phys[j] != NUMA_NO_NODE)
+				continue;
+			if (emu_nid_to_phys[i] != NUMA_NO_NODE)
+				physi = emu_nid_to_phys[i];
+			else
+				physi = i - max_emu_nid;
+			if (emu_nid_to_phys[j] != NUMA_NO_NODE)
+				physj = emu_nid_to_phys[j];
+			else
+				physj = j - max_emu_nid;
+			dist = phys_dist[physi * numa_dist_cnt + physj];
+			numa_set_distance(i, j, dist);
+		}
+	}
      	/* free the copied physical distance table */
     	memblock_free(phys_dist, phys_size);
     	return;
      no_emu:
+	numa_nodes_parsed = physnode_mask;
     	/* No emulation.  Build identity emu_nid_to_phys[] for numa_add_cpu() */
     	for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
     		emu_nid_to_phys[i] = i;
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index a3877e9bc878..ff4054f4334d 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -7,7 +7,7 @@
     #include <linux/numa.h>
     #include <linux/numa_memblks.h>
    -static int numa_distance_cnt;
+int numa_distance_cnt;
     static u8 *numa_distance;
      nodemask_t numa_nodes_parsed __initdata;


Best Regards,
Yan, Zi














[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux