[PATCH v1 0/2] support NUMA emulation for genertic arch

Rongwei Wang <rongwei.wang@xxxxxxxxxxxxxxxxx> · Tue, 20 Feb 2024 19:36:00 +0800

A brief introduction
====================

The NUMA emulation can fake more node base on a single
node system, e.g.

one node system:

[root@localhost ~]# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 31788 MB
node 0 free: 31446 MB
node distances:
node   0
  0:  10

add numa=fake=2 (fake 2 node on each origin node):

[root@localhost ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 15806 MB
node 0 free: 15451 MB
node 1 cpus: 0 1 2 3 4 5 6 7
node 1 size: 16029 MB
node 1 free: 15989 MB
node distances:
node   0   1
  0:  10  10
  1:  10  10

As above shown, a new node has been faked. As cpus, the realization
of x86 NUMA emulation is kept. Maybe each node should has 4 cores is
better (not sure, next to do if so).

Why do this
===========

It seems has following reasons:
  (1) In x86 host, apply NUMA emulation can fake more nodes environment
      to test or verify some performance stuff, but arm64 only has
      one method that modify ACPI table to do this. It's troublesome
      more or less.
  (2) Reduce competition for some locks. Here an example we found:
      will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious
      hotspot on lruvec->lock when test in single environment. What's
      more, The performance improved greatly if test in two more nodes
      system. The data shows below (more is better):

      ---------------------------------------------------------------------
      threads/process |   1     |     12   |     24   |   48     |   96
      ---------------------------------------------------------------------
      one node        | 14 1122 | 110 5372 | 111 2615 | 79 7084  | 72 4516
      ---------------------------------------------------------------------
      numa=fake=2     | 14 1168 | 144 4848 | 215 9070 | 157 0412 | 142 3968
      ---------------------------------------------------------------------
                      | For concurrency 12, no lruvec->lock hotspot. For 24,
      hotspot         | one node has 24% hotspot on lruvec->lock, but
                      | two nodes env hasn't.
      ---------------------------------------------------------------------

As for risks (e.g. numa balance...), they need to be discussed here.

Lastly, it seems not a good choice to realize x86 and other genertic
archs separately. But it can indeed avoid some architecture related
APIs adjustments and alleviate future maintenance. The previous RFC
link see [1].

Any advice are welcome, Thanks!

Change log
==========

RFC v1 -> v1
* add new CONFIG_NUMA_FAKE for genertic archs.
* keep x86 implementation, realize numa emulation in driver/base/ for
  genertic arch, e.g, arm64.

[1] RFC v1: https://patchwork.kernel.org/project/linux-arm-kernel/cover/20231012024842.99703-1-rongwei.wang@xxxxxxxxxxxxxxxxx/

Rongwei Wang (2):
  arch_numa: remove __init for early_cpu_to_node
  numa: introduce numa emulation for genertic arch

 drivers/base/Kconfig          |   9 +
 drivers/base/Makefile         |   1 +
 drivers/base/arch_numa.c      |  32 +-
 drivers/base/numa_emulation.c | 909 ++++++++++++++++++++++++++++++++++
 drivers/base/numa_emulation.h |  41 ++
 include/asm-generic/numa.h    |   2 +-
 6 files changed, 992 insertions(+), 2 deletions(-)
 create mode 100644 drivers/base/numa_emulation.c
 create mode 100644 drivers/base/numa_emulation.h

-- 
2.32.0.3.gf3a3e56d6