NUMA vs Proximity Domains

Francois Ozog <francois.ozog@xxxxxxxxxx> · Sat, 26 Oct 2019 14:12:25 +0200

Hi,

I'd like to share some past experience that may be relevant to the SDT
discussion.

In the context of 10Gbps networking I started to work on memory
affinity back in 2005. At some point I observed a processor with 16
cores and 4 memory channels, organized internally on two
interconnected dual rings (8 cores + 2 memory channels on a single
dual ring).
If you assign memory on the wrong dual ring, you have a 30% or more
performance penalty. Interleaving at various stages (socket, channel,
rank...) is not helping because we try to keep the hot data set as
small as possible (granules for interleaving were 64MB or 128 bytes
depending on the level and selected decoder policies that could not be
changed despite programmable).

Some "good" ACPI systems where properly reporting the distances
between the cores and the  memory channels, with visible increased
cost if you use wrong proximity domain. So advanced programmers were
able to leverage the topology at its best ;-)

Some technology appear to protect L3 cache for certain VMs and with
more sensitivity on latency and jitter I would guess that capturing
the right topology shall become (is becoming?) a priority.

Too bad, Linux NUMA policy completely masks the intra-socket
asymmetry. Taking into HPC, CCIX and CXL, the different memory
hierarchies may need a way richer information set than just the NUMA
socket.

So here are some questions:
- is there exploitable topology information available in DT to
identify the cost of using certain memory ranges (or other selectable
resource) by a core ?
- is the proximity model the best way to expose the topology
information for latency/jitter apps to consume. (not trying to get
exact topology information but rather "actionable knowledge" that can
be leveraged in a simple way by apps or schedulers or memory
allocators).
- How hard is introducing proximity domain, or whatever actionable
knowledge we identify, in Linux? I don't mean replace NUMA information
as it is good enough in a number of cases, but rather introduce
additional level of information.

Cordially,

FF

PS: memory characteristics such as persistence are orthogonal to this
discussion. Persistence is also complex: NVDIMMs (flash backed up
DRAM) but also pure flash DIMMS (Diablo technologies). other possible
memory characteristics: compute capable DIMMs (say a DIMM that has a
pattern matching algorithm - the first 64MB operating as a memory
mapped IO).