Re: NUMA vs Proximity Domains

Francois Ozog <francois.ozog@xxxxxxxxxx> · Mon, 28 Oct 2019 11:59:47 +0100

(reposting because of HTML mail format... sorry)

On Sat, 26 Oct 2019 at 22:32, Olof Johansson <olof@xxxxxxxxx> wrote:
>
> On Sat, Oct 26, 2019 at 5:12 AM Francois Ozog <francois.ozog@xxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > I'd like to share some past experience that may be relevant to the SDT
> > discussion.
> >
> > In the context of 10Gbps networking I started to work on memory
> > affinity back in 2005. At some point I observed a processor with 16
> > cores and 4 memory channels, organized internally on two
> > interconnected dual rings (8 cores + 2 memory channels on a single
> > dual ring).
> > If you assign memory on the wrong dual ring, you have a 30% or more
> > performance penalty. Interleaving at various stages (socket, channel,
> > rank...) is not helping because we try to keep the hot data set as
> > small as possible (granules for interleaving were 64MB or 128 bytes
> > depending on the level and selected decoder policies that could not be
> > changed despite programmable).
>
> This is literally what the DT numa spec already describes, isn't it?
>
> https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/numa.txt
>
On a Xeon, even a 5 years old one, there can be 2 proximity domains On
a single socket.
So if the numa node can represent that then the text shall be enhanced
to actually capture that a single socket can have more than one numa
node depending on the architecture .

> Interleaving indeed counteracts any efforts on describing topology if
> you interleave between different entities.
>
> > Some "good" ACPI systems where properly reporting the distances
> > between the cores and the  memory channels, with visible increased
> > cost if you use wrong proximity domain. So advanced programmers were
> > able to leverage the topology at its best ;-)
> >
> > Some technology appear to protect L3 cache for certain VMs and with
> > more sensitivity on latency and jitter I would guess that capturing
> > the right topology shall become (is becoming?) a priority.
> >
> > Too bad, Linux NUMA policy completely masks the intra-socket
> > asymmetry. Taking into HPC, CCIX and CXL, the different memory
> > hierarchies may need a way richer information set than just the NUMA
> > socket.
>
> There's no restriction on NUMA policy being bound only at the unit of
> a socket, you can choose to define domains as you see fit. The same
> challenges apply to some of the modern x86 platforms such as AMD's
> multi-die chips where some CPU chiplets have memory close to them and
> others don't.
>
The Documentation text loosely describes two cases and each case is
bound to socket limits. Too bad then.
> > So here are some questions:
> > - is there exploitable topology information available in DT to
> > identify the cost of using certain memory ranges (or other selectable
> > resource) by a core ?
>
> Yes
>
> > - is the proximity model the best way to expose the topology
> > information for latency/jitter apps to consume. (not trying to get
> > exact topology information but rather "actionable knowledge" that can
> > be leveraged in a simple way by apps or schedulers or memory
> > allocators).
>
> Probably, unless you have specific examples indicating otherwise.
> Imaginary complexity is always the worst kind -- "what if" designs
> that get overengineered and never needed in reality.
>
was Just opening discussion if things like genz or other technologies
where introducing new concepts to capture.
> > - How hard is introducing proximity domain, or whatever actionable
> > knowledge we identify, in Linux? I don't mean replace NUMA information
> > as it is good enough in a number of cases, but rather introduce
> > additional level of information.
>
> It's already there.
>
>
> -Olof

-- 
François-Frédéric Ozog | Director Linaro Edge & Fog Computing Group
T: +33.67221.6485
francois.ozog@xxxxxxxxxx | Skype: ffozog