Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation

Jerome Glisse <jglisse@xxxxxxxxxx> · Tue, 4 Dec 2018 15:41:53 -0500

On Tue, Dec 04, 2018 at 12:12:26PM -0800, Andi Kleen wrote:
> On Tue, Dec 04, 2018 at 01:24:22PM -0500, Jerome Glisse wrote:
> > Fast forward 2020 and you have this new type of memory that is not cache
> > coherent and you want to expose this to userspace through HMS. What you
> > do is a kernel patch that introduce the v2 type for target and define a
> > set of new sysfs file to describe what v2 is. On this new computer you
> > report your usual main memory as v1 and your new memory as v2.
> > 
> > So the application that only knew about v1 will keep using any v1 memory
> > on your new platform but it will not use any of the new memory v2 which
> > is what you want to happen. You do not have to break existing application
> > while allowing to add new type of memory.
> 
> That seems entirely like the wrong model. We don't want to rewrite every
> application for adding a new memory type.
> 
> Rather there needs to be an abstract way to query memory of specific
> behavior: e.g. cache coherent, size >= xGB, fastest or lowest latency or similar
> 
> Sure there can be a name somewhere, but it should only be used
> for identification purposes, not to hard code in applications.

Discussion with Logan convinced me to use a mask for property like:
    - cache coherent
    - persistent
    ...

Then files for other properties like:
    - bandwidth (bytes/s)
    - latency
    - granularity (size of individual access or bus width)
    ...

> 
> Really you need to define some use cases and describe how your API
> handles them.

I have given examples of how application looks today and how they
transform with HMS in my email exchange with Dave Hansen. I will
add them to the documentation and to the cover letter in my next
posting.

> > > 
> > > It sounds like you're trying to define a system call with built in
> > > ioctl? Is that really a good idea?
> > > 
> > > If you need ioctl you know where to find it.
> > 
> > Well i would like to get thing running in the wild with some guinea pig
> > user to get feedback from end user. It would be easier if i can do this
> > with upstream kernel and not some random branch in my private repo. While
> > doing that i would like to avoid commiting to a syscall upstream. So the
> > way i see around this is doing a driver under staging with an ioctl which
> > will be turn into a syscall once some confidence into the API is gain.
> 
> Ok that's fine I guess.
> 
> But should be a clearly defined ioctl, not an ioctl with redefinable parameters
> (but perhaps I misunderstood your description)
>
> > In the present version i took the other approach of defining just one
> > API that can grow to do more thing. I know the unix way is one simple
> > tool for one simple job. I can switch to the simple call for one action.
> 
> Simple calls are better.

I will switch to one simple call for each individual action (policy and
migration).

> > > > +Current memory policy infrastructure is node oriented, instead of
> > > > +changing that and risking breakage and regression HMS adds a new
> > > > +heterogeneous policy tracking infra-structure. The expectation is
> > > > +that existing application can keep using mbind() and all existing
> > > > +infrastructure under-disturb and unaffected, while new application
> > > > +will use the new API and should avoid mix and matching both (as they
> > > > +can achieve the same thing with the new API).
> > > 
> > > I think we need a stronger motivation to define a completely
> > > parallel and somewhat redundant infrastructure. What breakage
> > > are you worried about?
> > 
> > Some memory expose through HMS is not allocated by regular memory
> > allocator. For instance GPU memory is manage by GPU driver, so when
> > you want to use GPU memory (either as a policy or by migrating to it)
> > you need to use the GPU allocator to allocate that memory. HMS adds
> > a bunch of callback to target structure so that device driver can
> > expose a generic API to core kernel to do such allocation.
> 
> We already have nodes without memory.
> We can also take out nodes out of the normal fall back lists.
> We also have nodes with special memory (e.g. DMA32)
> 
> Nothing you describe here cannot be handled with the existing nodes.

They are have been patchset in the past to exclude node from allocation
last time i check they all were rejected and people felt it was not a
good thing to do.

Also IIRC adding more node might be problematic as i think we do not
have many bits left inside the flags field of struct page. Right now
i do not believe in moving device memory as generic node inside the
linux kernel because for many folks that will just be a waste, people
only doing desktop and not using their GPU for compute will never get
a good usage from that. Graphic memory allocation is wildely different
from compute allocation which is more like CPU one.

So converting graphic driver to register their memory as node does
not seems as a good idea at this time. I doubt the GPU folks upstream
would accept that (with my GPU hat ons i would not).

> > > The obvious alternative would of course be to add some extra
> > > enumeration to the existing nodes.
> > 
> > We can not extend NUMA node to expose GPU memory. GPU memory on
> > current AMD and Intel platform is not cache coherent and thus
> > should not be use for random memory allocation. It should really
> 
> Sure you don't expose it as normal memory, but it can be still
> tied to a node. In fact you have to for the existing topology
> interface to work.

The existing topology interface is not use today for that memory
and people in GPU world do not see it as an interface that can be
use. See above discussion about GPU memory. This is the raison
d'être of this proposal. A new way to expose heterogeneous memory
to userspace.

> > copy and rebuild their data structure inside the new memory. When
> > you move over thing like tree or any complex data structure you have
> > to rebuilt it ie redo the pointers link between the nodes of your
> > data structure.
> > 
> > This is highly error prone complex and wasteful (you have to burn
> > CPU cycles to do that). Now if you can use the same address space
> > as all the other memory allocation in your program and move data
> > around from one device to another with a common API that works on
> > all the various devices, you are eliminating that complex step and
> > making the end user life much easier.
> > 
> > So i am doing this to help existing users by addressing an issues
> > that is becoming harder and harder to solve for userspace. My end
> > game is to blur the boundary between CPU and device like GPU, FPGA,
> 
> This is just high level rationale. You already had that ...
> 
> What I was looking for is how applications actually use the 
> API.
> 
> e.g. 
> 
> 1. Compute application is looking for fast cache coherent memory 
> for CPU usage.
> 
> What does it query and how does it decide and how does it allocate?

Application have an OpenCL context from the context it gets the
device initiator unique id from the device initiator unique id
it looks at all the links and bridge the initiator is connected
to. Which gives it a list of links it can order that list using
bandwidth first and latency second (ie 2 link with same bandwidth
will be order with the one with slowest latency first). It goes
over that list from best to worse and for each links it looks at
what target are also connected to that link. From that it build
an ordered list of targets. It also only pick cache coherent
memory in that list.

It now use this ordered list of targets to set policy or migrate
its buffer to the best memory. Kernel will first try to use the
first target, if it runs out of that memory it will use the next
target ... so on and so forth.

This can all be down inside a userspace common helper library for
ease of use. More advance application will do finer allocation
for instance they will partition their dataset using the access
frequency. Most accessed dataset in the application will use
the fastest memory (which is likely to be somewhat small ie
few GigaBytes), while dataset that are more sparsely accessed
will be push to use slower memory (but they are more of it).

> 2. Allocator in OpenCL application is looking for memory to share
> with OpenCL. How does it find memory?

Same process as above, starts from initiator id, build links
list then build all target that initiator can access. Then
order that list according to the property of interest to the
application (bandwidth, latency, ...). Once it has the target
list it can use either policy or migration. Policy if it is
for a new allocation, migration if it is to migrate an existing
buffer to memory that is more appropriate for the OpenCL device
under use.

> 3. Storage application is looking for larger but slower memory
> for CPU usage.

Application build a list of initiator corresponding to the CPU
it is using (bind too). From that list of initiator it builds
a list of links (considering bridge too). From the list of links
it builds a list of target (connected to those links).

Then it order the list of target by size (not by latency or
bandwidth). Once it has an ordered list of target then it use
either the policy or migrate API for the range of virtual
address it wants to affect.

> 
> 4. ...
> 
> Please work out some use cases like this.

Note that above all the list building in userspace is intended
to be done by an helper library as this is really boiler plate
code. The last patch in my serie have userspace helpers to
parse the sysfs, i will grow that into a mini library with example
to show case it.

More example from other part of this email thread:

High level overview of how one application looks today:

    1) Application get some dataset from some source (disk, network,
       sensors, ...)
    2) Application allocate memory on device A and copy over the dataset
    3) Application run some CPU code to format the copy of the dataset
       inside device A memory (rebuild pointers inside the dataset,
       this can represent millions and millions of operations)
    4) Application run code on device A that use the dataset
    5) Application allocate memory on device B and copy over result
       from device A
    6) Application run some CPU code to format the copy of the dataset
       inside device B (rebuild pointers inside the dataset,
       this can represent millions and millions of operations)
    7) Application run code on device B that use the dataset
    8) Application copy result over from device B and keep on doing its
       thing

How it looks with HMS:
    1) Application get some dataset from some source (disk, network,
       sensors, ...)
    2-3) Application calls HMS to migrate to device A memory
    4) Application run code on device A that use the dataset
    5-6) Application calls HMS to migrate to device B memory
    7) Application run code on device B that use the dataset
    8) Application calls HMS to migrate result to main memory

So we now avoid explicit copy and having to rebuild data structure
inside each device address space.

Above example is for migrate. Here is an example for how the
topology is use today:

    Application knows that the platform is running on have 16
    GPU split into 2 group of 8 GPUs each. GPU in each group can
    access each other memory with dedicated mesh links between
    each others. Full speed no traffic bottleneck.

    Application splits its GPU computation in 2 so that each
    partition runs on a group of interconnected GPU allowing
    them to share the dataset.

With HMS:
    Application can query the kernel to discover the topology of
    system it is running on and use it to partition and balance
    its workload accordingly. Same application should now be able
    to run on new platform without having to adapt it to it.

Cheers,
Jérôme