Re: NUMA vs Proximity Domains

Olof Johansson <olof@xxxxxxxxx> · Sat, 26 Oct 2019 13:32:02 -0700

On Sat, Oct 26, 2019 at 5:12 AM Francois Ozog <francois.ozog@xxxxxxxxxx> wrote:
>
> Hi,
>
> I'd like to share some past experience that may be relevant to the SDT
> discussion.
>
> In the context of 10Gbps networking I started to work on memory
> affinity back in 2005. At some point I observed a processor with 16
> cores and 4 memory channels, organized internally on two
> interconnected dual rings (8 cores + 2 memory channels on a single
> dual ring).
> If you assign memory on the wrong dual ring, you have a 30% or more
> performance penalty. Interleaving at various stages (socket, channel,
> rank...) is not helping because we try to keep the hot data set as
> small as possible (granules for interleaving were 64MB or 128 bytes
> depending on the level and selected decoder policies that could not be
> changed despite programmable).

This is literally what the DT numa spec already describes, isn't it?

https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/numa.txt

Interleaving indeed counteracts any efforts on describing topology if
you interleave between different entities.

> Some "good" ACPI systems where properly reporting the distances
> between the cores and the  memory channels, with visible increased
> cost if you use wrong proximity domain. So advanced programmers were
> able to leverage the topology at its best ;-)
>
> Some technology appear to protect L3 cache for certain VMs and with
> more sensitivity on latency and jitter I would guess that capturing
> the right topology shall become (is becoming?) a priority.
>
> Too bad, Linux NUMA policy completely masks the intra-socket
> asymmetry. Taking into HPC, CCIX and CXL, the different memory
> hierarchies may need a way richer information set than just the NUMA
> socket.

There's no restriction on NUMA policy being bound only at the unit of
a socket, you can choose to define domains as you see fit. The same
challenges apply to some of the modern x86 platforms such as AMD's
multi-die chips where some CPU chiplets have memory close to them and
others don't.

> So here are some questions:
> - is there exploitable topology information available in DT to
> identify the cost of using certain memory ranges (or other selectable
> resource) by a core ?

Yes

> - is the proximity model the best way to expose the topology
> information for latency/jitter apps to consume. (not trying to get
> exact topology information but rather "actionable knowledge" that can
> be leveraged in a simple way by apps or schedulers or memory
> allocators).

Probably, unless you have specific examples indicating otherwise.
Imaginary complexity is always the worst kind -- "what if" designs
that get overengineered and never needed in reality.

> - How hard is introducing proximity domain, or whatever actionable
> knowledge we identify, in Linux? I don't mean replace NUMA information
> as it is good enough in a number of cases, but rather introduce
> additional level of information.

It's already there.

-Olof