On 03.07.20 12:59, Michal Hocko wrote: > On Fri 03-07-20 11:24:17, Michal Hocko wrote: >> [Cc Andi] >> >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote: >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: >>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote: >> [...] >>>>> Yep, looks like it. >>>>> >>>>> [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 >>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 >>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 >>>>> [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 >>>>> [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff] >>>>> [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff] >>>>> [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff] >>>> >>>> This begs a question whether ppc can do the same thing? >>> Or x86 stop doing it so that you can see on what node you are running? >>> >>> What's the point of this indirection other than another way of avoiding >>> empty node 0? >> >> Honestly, I do not have any idea. I've traced it down to >> Author: Andi Kleen <ak@xxxxxxx> >> Date: Tue Jan 11 15:35:48 2005 -0800 >> >> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing >> >> Fix fallout from the recent nodemask_t changes. The node ids assigned >> in the SRAT parser were off by one. >> >> I added a new first_unset_node() function to nodemask.h to allocate >> IDs sanely. >> >> Signed-off-by: Andi Kleen <ak@xxxxxxx> >> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxx> >> >> which doesn't really tell all that much. The historical baggage and a >> long term behavior which is not really trivial to fix I suspect. > > Thinking about this some more, this logic makes some sense afterall. > Especially in the world without memory hotplug which was very likely the > case back then. It is much better to have compact node mask rather than > sparse one. After all node numbers shouldn't really matter as long as > you have a clear mapping to the HW. I am not sure we export that > information (except for the kernel ring buffer) though. > > The memory hotplug changes that somehow because you can hotremove numa > nodes and therefore make the nodemask sparse but that is not a common > case. I am not sure what would happen if a completely new node was added > and its corresponding node was already used by the renumbered one > though. It would likely conflate the two I am afraid. But I am not sure > this is really possible with x86 and a lack of a bug report would > suggest that nobody is doing that at least. > I think the ACPI code takes care of properly mapping PXM to nodes. So if I start with PXM 0 empty and PXM 1 populated, I will get PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor $ echo "info numa" | sudo nc -U /var/tmp/monitor QEMU 5.0.50 monitor - type 'help' for more information (qemu) info numa 2 nodes node 0 cpus: node 0 size: 1024 MB node 0 plugged: 1024 MB node 1 cpus: 0 1 2 3 node 1 size: 4096 MB node 1 plugged: 0 MB I get in the guest: [ 50.174435] ------------[ cut here ]------------ [ 50.175436] node 1 was absent from the node_possible_map [ 50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290 [ 50.176844] Modules linked in: [ 50.176845] CPU: 0 PID: 7 Comm: kworker/u8:0 Not tainted 5.8.0-rc2+ #4 [ 50.176846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 [ 50.176846] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 50.176847] RIP: 0010:add_memory_resource+0x8c/0x290 [ 50.176849] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 63 c5 48 89 04 24 48 0f a3 05 94 6c 1c 01 72 17 89 ee 48 c78 [ 50.176849] RSP: 0018:ffffa7a1c0043d48 EFLAGS: 00010296 [ 50.176850] RAX: 000000000000002c RBX: ffff8bc633e63b80 RCX: 0000000000000000 [ 50.176851] RDX: ffff8bc63bc27060 RSI: ffff8bc63bc18d00 RDI: ffff8bc63bc18d00 [ 50.176851] RBP: 0000000000000001 R08: 00000000000001e1 R09: ffffa7a1c0043bd8 [ 50.176852] R10: 0000000000000005 R11: 0000000000000000 R12: 0000000140000000 [ 50.176852] R13: 000000017fffffff R14: 0000000040000000 R15: 0000000180000000 [ 50.176853] FS: 0000000000000000(0000) GS:ffff8bc63bc00000(0000) knlGS:0000000000000000 [ 50.176853] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 50.176855] CR2: 000055dfcbfc5ee8 CR3: 00000000aca0a000 CR4: 00000000000006f0 [ 50.176855] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 50.176856] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 50.176856] Call Trace: [ 50.176856] __add_memory+0x33/0x70 [ 50.176857] acpi_memory_device_add+0x132/0x2f2 [ 50.176857] acpi_bus_attach+0xd2/0x200 [ 50.176858] acpi_bus_scan+0x33/0x70 [ 50.176858] acpi_device_hotplug+0x298/0x390 [ 50.176858] acpi_hotplug_work_fn+0x3d/0x50 [ 50.176859] process_one_work+0x1b4/0x370 [ 50.176859] worker_thread+0x53/0x3e0 [ 50.176860] ? process_one_work+0x370/0x370 [ 50.176860] kthread+0x119/0x140 [ 50.176860] ? __kthread_bind_mask+0x60/0x60 [ 50.176861] ret_from_fork+0x22/0x30 [ 50.176861] ---[ end trace 9a2a837c1e0164f1 ]--- [ 50.209816] acpi PNP0C80:00: add_memory failed [ 50.210510] acpi PNP0C80:00: acpi_memory_enable_device() error [ 50.211445] acpi PNP0C80:00: Enumeration failure I remember that we added that check just recently (due to powerpc if I am not wrong). Not sure why that triggers here. But it properly maps PXM 0 to node 1. -- Thanks, David / dhildenb