Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx> writes: > * Oscar Salvador <osalvador@xxxxxxx> [2022-04-06 18:19:00]: > >> On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote: >> > arch/powerpc/mm/numa.c | 2 +- >> > 1 file changed, 1 insertion(+), 1 deletion(-) >> > >> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >> > index b9b7fefbb64b..13022d734951 100644 >> > --- a/arch/powerpc/mm/numa.c >> > +++ b/arch/powerpc/mm/numa.c >> > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu) >> > if (new_nid < 0 || !node_possible(new_nid)) >> > new_nid = first_online_node; >> > >> > - if (NODE_DATA(new_nid) == NULL) { >> > + if (!node_online(new_nid)) { >> > #ifdef CONFIG_MEMORY_HOTPLUG >> > /* >> > * Need to ensure that NODE_DATA is initialized for a node from >> >> Because of this fix, I wanted to check whether we might have any more fallouts due >> to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer >> as to why powerpc is the only platform that special cases try_online_node(), >> while all others rely on cpu_up()->try_online_node() to do the right thing. >> >> So, I had a look. >> Let us rewind a bit: >> >> The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was >> e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes"). >> >> In there, it says that we have the following picture: >> >> partition_sched_domains >> arch_update_cpu_topology >> numa_update_cpu_topology >> find_and_online_cpu_nid >> >> and by the time find_and_online_cpu_nid() gets called to online the node, it might be >> too late as we might have referenced some NODE_DATA() fields. >> Note that this happens at a much later stage in cpuhp. >> >> Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(), >> which should allocate-and-online the node and prevent accessing garbage. >> But the problem is that, on powerpc, all possible cpus have the same node set at boot stage >> (see arch/powerpc/mm/numa.c:mem_topology_setup), >> so cpu_to_node() returns the same thing until it the mapping gets update (which happens in >> start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not >> apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu. >> >> (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node() >> tries to online node0, or whatever old mapping numa<->cpu is there). >> >> So, we have something like: >> >> dlpar_online_cpu >> device_online >> dev->bus->online >> cpu_subsys_online >> cpu_device_up >> cpu_up >> try_online_node (old mapping nid <-> cpu ) >> ... >> ... >> cphp_callbacks >> sched_cpu_activate >> cpuset_update_active_cpus >> schedule_work(&cpuset_hotplug_work) >> cpuset_hotplug_work >> partition_sched_domains >> arch_update_cpu_topology >> numa_update_cpu_topology >> find_and_online_cpu_nid (online new_nid) >> >> >> So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid() >> happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu() >> fixed the issue, but we should not need this special casing at all. >> >> We do already know the numa<->cpu associativity in >> dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to >> update the numa<->cpu mapping, and let the try_online_node() do the right thing. >> >> With this in mind, I came up with the following patch, where I carried out a battery >> of tests for CPU hotplug stuff and it worked as expected. >> But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so >> I would like to hear from more familiar people. >> >> The patch is: >> >> From: Oscar Salvador <osalvador@xxxxxxx> >> Date: Wed, 6 Apr 2022 14:39:15 +0200 >> Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier >> >> powerpc is the only platform that do not rely on >> cpu_up()->try_online_node() to bring up a numa node, >> and special cases it, instead, deep in its own machinery: >> >> dlpar_online_cpu >> find_and_online_cpu_nid >> try_online_node >> >> This should not be needed, but the thing is that the try_online_node() >> from cpu_up() will not apply on the right node, because cpu_to_node() >> will return the old mapping numa<->cpu that gets set on boot stage >> for all possible cpus. >> >> That can be seen easily if we try to print out the numa node passed >> to try_online_node() in cpu_up(). >> >> The thing is that the numa<->cpu mapping does not get updated till a much >> later stage in start_secondary: >> >> start_secondary: >> set_numa_node(numa_cpu_lookup_table[cpu]) >> >> But we do not really care, as we already now the >> CPU <-> NUMA associativity back in find_and_online_cpu_nid(), >> so let us make use of that and set the proper numa<->cpu mapping, >> so cpu_to_node() in cpu_up() returns the right node and >> try_online_node() can do its work. >> >> Signed-off-by: Oscar Salvador <osalvador@xxxxxxx> > > Looks good to me. > > Reviewed-by: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx> Yeah agreed, thanks for getting to the root of the problem. Can you resend as a standalone patch. Because you sent it as a reply it won't be recognised by patchwork[1] which means it risks getting lost. cheers 1: http://patchwork.ozlabs.org/project/linuxppc-dev/list/