Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

Jerome Glisse <jglisse@xxxxxxxxxx> · Tue, 4 Dec 2018 19:15:44 -0500

On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
> On 12/3/18 3:34 PM, jglisse@xxxxxxxxxx wrote:
> > This patchset use the above scheme to expose system topology through
> > sysfs under /sys/bus/hms/ with:
> >     - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
> >       each has a UID and you can usual value in that folder (node id,
> >       size, ...)
> > 
> >     - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
> >       (CPU or device), each has a HMS UID but also a CPU id for CPU
> >       (which match CPU id in (/sys/bus/cpu/). For device you have a
> >       path that can be PCIE BUS ID for instance)
> > 
> >     - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
> >       UID and a file per property (bandwidth, latency, ...) you also
> >       find a symlink to every target and initiator connected to that
> >       link.
> > 
> >     - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
> >       a UID and a file per property (bandwidth, latency, ...) you
> >       also find a symlink to all initiators that can use that bridge.
> 
> We support 1024 NUMA nodes on x86.  The ACPI HMAT expresses the
> connections between each node.  Let's suppose that each node has some
> CPUs and some memory.
> 
> That means we'll have 1024 target directories in sysfs, 1024 initiator
> directories in sysfs, and 1024*1024 link directories.  Or, would the
> kernel be responsible for "compiling" the firmware-provided information
> down into a more manageable number of links?
> 
> Some idiot made the mistake of having one sysfs directory per 128MB of
> memory way back when, and now we have hundreds of thousands of
> /sys/devices/system/memory/memoryX directories.  That sucks to manage.
> Isn't this potentially repeating that mistake?
> 
> Basically, is sysfs the right place to even expose this much data?

I definitly want to avoid the memoryX mistake. So i do not want to
see one link directory per device. Taking my simple laptop as an
example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
discret one):

link0: cpu0 cpu1 cpu2 cpu3
link1: wifi (2 pcie lane)
link2: gpu0 (unknown number of lane but i believe it has higher
             bandwidth to main memory)
link3: gpu1 (16 pcie lane)
link4: gpu1 and gpu memory

So one link directory per number of pcie lane your device have
so that you can differentiate on bandwidth. The main memory is
symlinked inside all the link directory except link4. The GPU
discret memory is only in link4 directory as it is only
accessible by the GPU (we could add it under link3 too with the
non cache coherent property attach to it).

The issue then becomes how to convert down the HMAT over verbose
information to populate some reasonable layout for HMS. For that
i would say that create a link directory for each different
matrix cell. As an example let say that each entry in the matrix
has bandwidth and latency then we create a link directory for
each combination of bandwidth and latency. On simple system that
should boils down to a handfull of combination roughly speaking
mirroring the example above of one link directory per number of
PCIE lane for instance.

I don't think i have a system with an HMAT table if you have one
HMAT table to provide i could show up the end result.

Note i believe the ACPI HMAT matrix is a bad design for that
reasons ie there is lot of commonality in many of the matrix
entry and many entry also do not make sense (ie initiator not
being able to access all the targets). I feel that link/bridge
is much more compact and allow to represent any directed graph
with multiple arrows from one node to another same node.

Cheers,
Jérôme