On Sat, Oct 26, 2019 at 5:12 AM Francois Ozog <francois.ozog@xxxxxxxxxx> wrote: > > Hi, > > I'd like to share some past experience that may be relevant to the SDT > discussion. > > In the context of 10Gbps networking I started to work on memory > affinity back in 2005. At some point I observed a processor with 16 > cores and 4 memory channels, organized internally on two > interconnected dual rings (8 cores + 2 memory channels on a single > dual ring). > If you assign memory on the wrong dual ring, you have a 30% or more > performance penalty. Interleaving at various stages (socket, channel, > rank...) is not helping because we try to keep the hot data set as > small as possible (granules for interleaving were 64MB or 128 bytes > depending on the level and selected decoder policies that could not be > changed despite programmable). This is literally what the DT numa spec already describes, isn't it? https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/numa.txt Interleaving indeed counteracts any efforts on describing topology if you interleave between different entities. > Some "good" ACPI systems where properly reporting the distances > between the cores and the memory channels, with visible increased > cost if you use wrong proximity domain. So advanced programmers were > able to leverage the topology at its best ;-) > > Some technology appear to protect L3 cache for certain VMs and with > more sensitivity on latency and jitter I would guess that capturing > the right topology shall become (is becoming?) a priority. > > Too bad, Linux NUMA policy completely masks the intra-socket > asymmetry. Taking into HPC, CCIX and CXL, the different memory > hierarchies may need a way richer information set than just the NUMA > socket. There's no restriction on NUMA policy being bound only at the unit of a socket, you can choose to define domains as you see fit. The same challenges apply to some of the modern x86 platforms such as AMD's multi-die chips where some CPU chiplets have memory close to them and others don't. > So here are some questions: > - is there exploitable topology information available in DT to > identify the cost of using certain memory ranges (or other selectable > resource) by a core ? Yes > - is the proximity model the best way to expose the topology > information for latency/jitter apps to consume. (not trying to get > exact topology information but rather "actionable knowledge" that can > be leveraged in a simple way by apps or schedulers or memory > allocators). Probably, unless you have specific examples indicating otherwise. Imaginary complexity is always the worst kind -- "what if" designs that get overengineered and never needed in reality. > - How hard is introducing proximity domain, or whatever actionable > knowledge we identify, in Linux? I don't mean replace NUMA information > as it is good enough in a number of cases, but rather introduce > additional level of information. It's already there. -Olof