(reposting because of HTML mail format... sorry) On Sat, 26 Oct 2019 at 22:32, Olof Johansson <olof@xxxxxxxxx> wrote: > > On Sat, Oct 26, 2019 at 5:12 AM Francois Ozog <francois.ozog@xxxxxxxxxx> wrote: > > > > Hi, > > > > I'd like to share some past experience that may be relevant to the SDT > > discussion. > > > > In the context of 10Gbps networking I started to work on memory > > affinity back in 2005. At some point I observed a processor with 16 > > cores and 4 memory channels, organized internally on two > > interconnected dual rings (8 cores + 2 memory channels on a single > > dual ring). > > If you assign memory on the wrong dual ring, you have a 30% or more > > performance penalty. Interleaving at various stages (socket, channel, > > rank...) is not helping because we try to keep the hot data set as > > small as possible (granules for interleaving were 64MB or 128 bytes > > depending on the level and selected decoder policies that could not be > > changed despite programmable). > > This is literally what the DT numa spec already describes, isn't it? > > https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/numa.txt > On a Xeon, even a 5 years old one, there can be 2 proximity domains On a single socket. So if the numa node can represent that then the text shall be enhanced to actually capture that a single socket can have more than one numa node depending on the architecture . > Interleaving indeed counteracts any efforts on describing topology if > you interleave between different entities. > > > Some "good" ACPI systems where properly reporting the distances > > between the cores and the memory channels, with visible increased > > cost if you use wrong proximity domain. So advanced programmers were > > able to leverage the topology at its best ;-) > > > > Some technology appear to protect L3 cache for certain VMs and with > > more sensitivity on latency and jitter I would guess that capturing > > the right topology shall become (is becoming?) a priority. > > > > Too bad, Linux NUMA policy completely masks the intra-socket > > asymmetry. Taking into HPC, CCIX and CXL, the different memory > > hierarchies may need a way richer information set than just the NUMA > > socket. > > There's no restriction on NUMA policy being bound only at the unit of > a socket, you can choose to define domains as you see fit. The same > challenges apply to some of the modern x86 platforms such as AMD's > multi-die chips where some CPU chiplets have memory close to them and > others don't. > The Documentation text loosely describes two cases and each case is bound to socket limits. Too bad then. > > So here are some questions: > > - is there exploitable topology information available in DT to > > identify the cost of using certain memory ranges (or other selectable > > resource) by a core ? > > Yes > > > - is the proximity model the best way to expose the topology > > information for latency/jitter apps to consume. (not trying to get > > exact topology information but rather "actionable knowledge" that can > > be leveraged in a simple way by apps or schedulers or memory > > allocators). > > Probably, unless you have specific examples indicating otherwise. > Imaginary complexity is always the worst kind -- "what if" designs > that get overengineered and never needed in reality. > was Just opening discussion if things like genz or other technologies where introducing new concepts to capture. > > - How hard is introducing proximity domain, or whatever actionable > > knowledge we identify, in Linux? I don't mean replace NUMA information > > as it is good enough in a number of cases, but rather introduce > > additional level of information. > > It's already there. > > > -Olof -- François-Frédéric Ozog | Director Linaro Edge & Fog Computing Group T: +33.67221.6485 francois.ozog@xxxxxxxxxx | Skype: ffozog