Re: NUMA vs Proximity Domains

Rob Herring <robh@xxxxxxxxxx> · Mon, 28 Oct 2019 10:24:48 -0500

On Mon, Oct 28, 2019 at 6:00 AM Francois Ozog <francois.ozog@xxxxxxxxxx> wrote:
>
> (reposting because of HTML mail format... sorry)
>
>
> On Sat, 26 Oct 2019 at 22:32, Olof Johansson <olof@xxxxxxxxx> wrote:
> >
> > On Sat, Oct 26, 2019 at 5:12 AM Francois Ozog <francois.ozog@xxxxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > I'd like to share some past experience that may be relevant to the SDT
> > > discussion.
> > >
> > > In the context of 10Gbps networking I started to work on memory
> > > affinity back in 2005. At some point I observed a processor with 16
> > > cores and 4 memory channels, organized internally on two
> > > interconnected dual rings (8 cores + 2 memory channels on a single
> > > dual ring).
> > > If you assign memory on the wrong dual ring, you have a 30% or more
> > > performance penalty. Interleaving at various stages (socket, channel,
> > > rank...) is not helping because we try to keep the hot data set as
> > > small as possible (granules for interleaving were 64MB or 128 bytes
> > > depending on the level and selected decoder policies that could not be
> > > changed despite programmable).
> >
> > This is literally what the DT numa spec already describes, isn't it?
> >
> > https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/numa.txt
> >
> On a Xeon, even a 5 years old one, there can be 2 proximity domains On
> a single socket.
> So if the numa node can represent that then the text shall be enhanced
> to actually capture that a single socket can have more than one numa
> node depending on the architecture .

The example says 2 sockets, but that is completely outside the scope
of the binding. IOW, a 16 core SoC with 2 domains would have exactly
the same binding.

> > Interleaving indeed counteracts any efforts on describing topology if
> > you interleave between different entities.
> >
> > > Some "good" ACPI systems where properly reporting the distances
> > > between the cores and the  memory channels, with visible increased
> > > cost if you use wrong proximity domain. So advanced programmers were
> > > able to leverage the topology at its best ;-)
> > >
> > > Some technology appear to protect L3 cache for certain VMs and with
> > > more sensitivity on latency and jitter I would guess that capturing
> > > the right topology shall become (is becoming?) a priority.
> > >
> > > Too bad, Linux NUMA policy completely masks the intra-socket
> > > asymmetry. Taking into HPC, CCIX and CXL, the different memory
> > > hierarchies may need a way richer information set than just the NUMA
> > > socket.
> >
> > There's no restriction on NUMA policy being bound only at the unit of
> > a socket, you can choose to define domains as you see fit. The same
> > challenges apply to some of the modern x86 platforms such as AMD's
> > multi-die chips where some CPU chiplets have memory close to them and
> > others don't.
> >
> The Documentation text loosely describes two cases and each case is
> bound to socket limits. Too bad then.

Sorry, I don't follow. Do you have an example you don't think is
covered by the binding?

Rob