Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

Dave Hansen <dave.hansen@xxxxxxxxx> · Tue, 4 Dec 2018 17:06:49 -0800

On 12/4/18 4:15 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
>> Basically, is sysfs the right place to even expose this much data?
> 
> I definitly want to avoid the memoryX mistake. So i do not want to
> see one link directory per device. Taking my simple laptop as an
> example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
> discret one):
> 
> link0: cpu0 cpu1 cpu2 cpu3
> link1: wifi (2 pcie lane)
> link2: gpu0 (unknown number of lane but i believe it has higher
>              bandwidth to main memory)
> link3: gpu1 (16 pcie lane)
> link4: gpu1 and gpu memory
> 
> So one link directory per number of pcie lane your device have
> so that you can differentiate on bandwidth. The main memory is
> symlinked inside all the link directory except link4. The GPU
> discret memory is only in link4 directory as it is only
> accessible by the GPU (we could add it under link3 too with the
> non cache coherent property attach to it).

I'm actually really interested in how this proposal scales.  It's quite
easy to represent a laptop, but can this scale to the largest systems
that we expect to encounter over the next 20 years that this ABI will live?

> The issue then becomes how to convert down the HMAT over verbose
> information to populate some reasonable layout for HMS. For that
> i would say that create a link directory for each different
> matrix cell. As an example let say that each entry in the matrix
> has bandwidth and latency then we create a link directory for
> each combination of bandwidth and latency. On simple system that
> should boils down to a handfull of combination roughly speaking
> mirroring the example above of one link directory per number of
> PCIE lane for instance.

OK, but there are 1024*1024 matrix cells on a systems with 1024
proximity domains (ACPI term for NUMA node).  So it sounds like you are
proposing a million-directory approach.

We also can't simply say that two CPUs with the same connection to two
other CPUs (think a 4-socket QPI-connected system) share the same "link"
because they share the same combination of bandwidth and latency.  We
need to know that *each* has its own, unique link and do not share link
resources.

> I don't think i have a system with an HMAT table if you have one
> HMAT table to provide i could show up the end result.

It is new enough (ACPI 6.2) that no publicly-available hardware that
exists that implements one (that I know of).  Keith Busch can probably
extract one and send it to you or show you how we're faking them with QEMU.

> Note i believe the ACPI HMAT matrix is a bad design for that
> reasons ie there is lot of commonality in many of the matrix
> entry and many entry also do not make sense (ie initiator not
> being able to access all the targets). I feel that link/bridge
> is much more compact and allow to represent any directed graph
> with multiple arrows from one node to another same node.

I don't disagree.  But, folks are building systems with them and we need
to either deal with it, or make its data manageable.  You saw our
approach: we cull the data and only expose the bare minimum in sysfs.