On Fri 03-07-20 13:32:21, David Hildenbrand wrote: > On 03.07.20 12:59, Michal Hocko wrote: > > On Fri 03-07-20 11:24:17, Michal Hocko wrote: > >> [Cc Andi] > >> > >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote: > >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: > >>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote: > >> [...] > >>>>> Yep, looks like it. > >>>>> > >>>>> [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > >>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > >>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > >>>>> [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > >>>>> [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff] > >>>>> [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff] > >>>>> [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff] > >>>> > >>>> This begs a question whether ppc can do the same thing? > >>> Or x86 stop doing it so that you can see on what node you are running? > >>> > >>> What's the point of this indirection other than another way of avoiding > >>> empty node 0? > >> > >> Honestly, I do not have any idea. I've traced it down to > >> Author: Andi Kleen <ak@xxxxxxx> > >> Date: Tue Jan 11 15:35:48 2005 -0800 > >> > >> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing > >> > >> Fix fallout from the recent nodemask_t changes. The node ids assigned > >> in the SRAT parser were off by one. > >> > >> I added a new first_unset_node() function to nodemask.h to allocate > >> IDs sanely. > >> > >> Signed-off-by: Andi Kleen <ak@xxxxxxx> > >> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxx> > >> > >> which doesn't really tell all that much. The historical baggage and a > >> long term behavior which is not really trivial to fix I suspect. > > > > Thinking about this some more, this logic makes some sense afterall. > > Especially in the world without memory hotplug which was very likely the > > case back then. It is much better to have compact node mask rather than > > sparse one. After all node numbers shouldn't really matter as long as > > you have a clear mapping to the HW. I am not sure we export that > > information (except for the kernel ring buffer) though. > > > > The memory hotplug changes that somehow because you can hotremove numa > > nodes and therefore make the nodemask sparse but that is not a common > > case. I am not sure what would happen if a completely new node was added > > and its corresponding node was already used by the renumbered one > > though. It would likely conflate the two I am afraid. But I am not sure > > this is really possible with x86 and a lack of a bug report would > > suggest that nobody is doing that at least. > > > > I think the ACPI code takes care of properly mapping PXM to nodes. > > So if I start with PXM 0 empty and PXM 1 populated, I will get > PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU > > $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor > $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor > > $ echo "info numa" | sudo nc -U /var/tmp/monitor > QEMU 5.0.50 monitor - type 'help' for more information > (qemu) info numa > 2 nodes > node 0 cpus: > node 0 size: 1024 MB > node 0 plugged: 1024 MB > node 1 cpus: 0 1 2 3 > node 1 size: 4096 MB > node 1 plugged: 0 MB Thanks for double checking. > I get in the guest: > > [ 50.174435] ------------[ cut here ]------------ > [ 50.175436] node 1 was absent from the node_possible_map > [ 50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290 This would mean that the ACPI code or whoever does the remaping is not adding the new node into possible nodes. [...] > I remember that we added that check just recently (due to powerpc if I am not wrong). > Not sure why that triggers here. This was a misbehaving Qemu IIRC providing a garbage map. -- Michal Hocko SUSE Labs