On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: > On Wed 01-07-20 13:30:57, David Hildenbrand wrote: > > On 01.07.20 13:06, David Hildenbrand wrote: > > > On 01.07.20 13:01, Srikar Dronamraju wrote: > > >> * David Hildenbrand <david@xxxxxxxxxx> [2020-07-01 12:15:54]: > > >> > > >>> On 01.07.20 12:04, Srikar Dronamraju wrote: > > >>>> * Michal Hocko <mhocko@xxxxxxxxxx> [2020-07-01 10:42:00]: > > >>>> > > >>>>> > > >>>>>> > > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The > > >>>>>> number of online nodes is inconsistent with the information in the > > >>>>>> device-tree and resource-dump > > >>>>>> > > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing > > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take > > >>>>>> the hit from the unnecessary numa hinting faults. > > >>>>> > > >>>>> I have to say that I dislike the node online/offline state and directly > > >>>>> exporting that to the userspace. Users should only care whether the node > > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just > > >>>>> offline all the present memory blocks but do not physically hot remove > > >>>>> them and you are in the same situation. If users are confused by an > > >>>>> output of tools like numactl -H then those could be updated and hide > > >>>>> nodes without any memory&cpus. > > >>>>> > > >>>>> The autonuma problem sounds interesting but again this patch doesn't > > >>>>> really solve the underlying problem because I strongly suspect that the > > >>>>> problem is still there when a numa node gets all its memory offline as > > >>>>> mentioned above. > > I would really appreciate a feedback to these two as well. > > > >>>>> While I completely agree that making node 0 special is wrong, I have > > >>>>> still hard time to review this very simply looking patch because all the > > >>>>> numa initialization is so spread around that this might just blow up > > >>>>> at unexpected places. IIRC we have discussed testing in the previous > > >>>>> version and David has provided a way to emulate these configurations > > >>>>> on x86. Did you manage to use those instruction for additional testing > > >>>>> on other than ppc architectures? > > >>>>> > > >>>> > > >>>> I have tried all the steps that David mentioned and reported back at > > >>>> https://lore.kernel.org/lkml/20200511174731.GD1961@xxxxxxxxxxxxxxxxxx/t/#u > > >>>> > > >>>> As a summary, David's steps are still not creating a memoryless/cpuless on > > >>>> x86 VM. > > >>> > > >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not > > >>> online*. Once you hotplug some memory, it will switch online. Once you > > >>> remove memory, it will switch back offline. > > >>> > > >> > > >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at > > >> boot. The code in question tries to handle a cpuless/memoryless node 0 at > > >> boot. > > > > > > I was just correcting your statement, because it was wrong. > > > > > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither > > > have CPUs nor memory. That would imply that we can, in fact, never have > > > node 0 offline during boot. > > > > > > > Yep, looks like it. > > > > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff] > > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff] > > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff] > > This begs a question whether ppc can do the same thing? Or x86 stop doing it so that you can see on what node you are running? What's the point of this indirection other than another way of avoiding empty node 0? Thanks Michal