Re: [PATCH v2] NUMA: Early use of cpu_to_node() returns 0 instead of the correct node id

Shijie Huang <shijie@xxxxxxxxxxxxxxxxxxxxxxxxxx> · Thu, 25 Jan 2024 17:15:29 +0800

在 2024/1/25 15:31, Mike Rapoport 写道:
On Wed, Jan 24, 2024 at 09:19:00AM -0800, Lameter, Christopher wrote:
On Tue, 23 Jan 2024, Huang Shijie wrote:

During the kernel booting, the generic cpu_to_node() is called too early in
arm64, powerpc and riscv when CONFIG_NUMA is enabled.

For arm64/powerpc/riscv, there are at least four places in the common code
where the generic cpu_to_node() is called before it is initialized:
	   1.) early_trace_init()         in kernel/trace/trace.c
	   2.) sched_init()               in kernel/sched/core.c
	   3.) init_sched_fair_class()    in kernel/sched/fair.c
	   4.) workqueue_init_early()     in kernel/workqueue.c

In order to fix the bug, the patch changes generic cpu_to_node to
function pointer, and export it for kernel modules.
Introduce smp_prepare_boot_cpu_start() to wrap the original
smp_prepare_boot_cpu(), and set cpu_to_node with early_cpu_to_node.
Introduce smp_prepare_cpus_done() to wrap the original smp_prepare_cpus(),
and set the cpu_to_node to formal _cpu_to_node().
Would  you please fix this cleanly without a function pointer?

What I think needs to be done is a patch series.

1. Instrument cpu_to_node so that some warning is issued if it is used too
early. Preloading the array with NUMA_NO_NODE would allow us to do that.

2. Implement early_cpu_to_node on platforms that currently do not have it.

3. A series of patches that fix each place where cpu_to_node is used too
early.

For step 3, I find it it hard to change the cpu_to_node() to 
early_cpu_to_node() for early_trace_init().

In early_trace_init(), the __ring_buffer_alloc() calls the cpu_to_node().

In order to fix the bug, we should use early_cpu_to_node() for 
__ring_buffer_alloc().

But __ring_buffer_alloc() is also used by the kernel after the booting 
finished.

After the booting finishes, we should use the cpu_to_node(), not the 
early_cpu_to_node().

I think step 3 can be simplified with a generic function that sets
per_cpu(numa_node) using early_cpu_to_node(). It can be called right after
setup_per_cpu_areas().

I think this method maybe better..

I will try this too.

Thanks

Huang Shijie