On 12/4/18 4:15 PM, Jerome Glisse wrote: > On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote: >> Basically, is sysfs the right place to even expose this much data? > > I definitly want to avoid the memoryX mistake. So i do not want to > see one link directory per device. Taking my simple laptop as an > example with 4 CPUs, a wifi and 2 GPU (the integrated one and a > discret one): > > link0: cpu0 cpu1 cpu2 cpu3 > link1: wifi (2 pcie lane) > link2: gpu0 (unknown number of lane but i believe it has higher > bandwidth to main memory) > link3: gpu1 (16 pcie lane) > link4: gpu1 and gpu memory > > So one link directory per number of pcie lane your device have > so that you can differentiate on bandwidth. The main memory is > symlinked inside all the link directory except link4. The GPU > discret memory is only in link4 directory as it is only > accessible by the GPU (we could add it under link3 too with the > non cache coherent property attach to it). I'm actually really interested in how this proposal scales. It's quite easy to represent a laptop, but can this scale to the largest systems that we expect to encounter over the next 20 years that this ABI will live? > The issue then becomes how to convert down the HMAT over verbose > information to populate some reasonable layout for HMS. For that > i would say that create a link directory for each different > matrix cell. As an example let say that each entry in the matrix > has bandwidth and latency then we create a link directory for > each combination of bandwidth and latency. On simple system that > should boils down to a handfull of combination roughly speaking > mirroring the example above of one link directory per number of > PCIE lane for instance. OK, but there are 1024*1024 matrix cells on a systems with 1024 proximity domains (ACPI term for NUMA node). So it sounds like you are proposing a million-directory approach. We also can't simply say that two CPUs with the same connection to two other CPUs (think a 4-socket QPI-connected system) share the same "link" because they share the same combination of bandwidth and latency. We need to know that *each* has its own, unique link and do not share link resources. > I don't think i have a system with an HMAT table if you have one > HMAT table to provide i could show up the end result. It is new enough (ACPI 6.2) that no publicly-available hardware that exists that implements one (that I know of). Keith Busch can probably extract one and send it to you or show you how we're faking them with QEMU. > Note i believe the ACPI HMAT matrix is a bad design for that > reasons ie there is lot of commonality in many of the matrix > entry and many entry also do not make sense (ie initiator not > being able to access all the targets). I feel that link/bridge > is much more compact and allow to represent any directed graph > with multiple arrows from one node to another same node. I don't disagree. But, folks are building systems with them and we need to either deal with it, or make its data manageable. You saw our approach: we cull the data and only expose the bare minimum in sysfs.