Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

Jerome Glisse <jglisse@xxxxxxxxxx> · Thu, 6 Dec 2018 15:27:06 -0500

On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote:
> On 12/6/18 11:20 AM, Jerome Glisse wrote:
> >>> For case 1 you can pre-parse stuff but this can be done by helper library
> >> How would that work?  Would each user/container/whatever do this once?
> >> Where would they keep the pre-parsed stuff?  How do they manage their
> >> cache if the topology changes?
> > Short answer i don't expect a cache, i expect that each program will have
> > a init function that query the topology and update the application codes
> > accordingly.
> 
> My concern with having folks do per-program parsing, *and* having a huge
> amount of data to parse makes it unusable.  The largest systems will
> literally have hundreds of thousands of objects in /sysfs, even in a
> single directory.  That makes readdir() basically impossible, and makes
> even open() (if you already know the path you want somehow) hard to do fast.
> 
> I just don't think sysfs (or any filesystem, really) can scale to
> express large, complicated topologies in a way that any normal program
> can practically parse it.
> 
> My suspicion is that we're going to need to have the kernel parse and
> cache these things.  We *might* have the data available in sysfs, but we
> can't reasonably expect anyone to go parsing it.

What i am failing to explain is that kernel can not parse because kernel
does not know what the application cares about and every single applications
will make different choices and thus select differents devices and memory.

It is not even gonna a thing like class A of application will do X and
class B will do Y. Every single application in class A might do something
different because somes care about the little details.

So any kind of pre-parsing in the kernel is defeated by the fact that the
kernel does not know what the application is looking for.

I do not see anyway to express the application logic in something that
can be some kind of automaton or regular expression. The application can
litteraly intro-inspect itself and the topology to partition its workload.
The topology and device selection is expected to be thousands of line of
code in the most advance application.

Even worse inside one same application, they might be different device
partition and memory selection for different function in the application.

I am not scare about the anount of data to parse really, even on big node
it is gonna be few dozens of links and bridges, and few dozens of devices.
So we are talking hundred directories to parse and read.

Maybe an example will help. Let say we have an application with the
following pipeline:

    inA -> functionA -> outA -> functionB -> outB -> functionC -> result

    - inA 8 gigabytes
    - outA 8 gigabytes
    - outB one dword
    - result something small
    - functionA is doing heavy computation on inA (several thousands of
      instructions for each dword in inA).
    - functionB is doing heavy computation for each dword in outA (again
      thousand of instruction for each dword) and it is looking for a
      specific result that it knows will be unique among all the dword
      computation ie it is output only one dword in outB
    - functionC is something well suited for CPU that take outB and turns
      it into the final result

Now let see few different system and their topologies:
    [T2] 1 GPU with 16GB of memory and a handfull of CPU cores
    [T1] 1 GPU with 8GB of memory and a handfull of CPU cores
    [T3] 2 GPU with 8GB of memory and a handfull of CPU core
    [T4] 2 GPU with 8GB of memory and a handfull of CPU core
         the 2 GPU have a very fast link between each others
         (400GBytes/s)

Now let see how the program will partition itself for each topology:
    [T1] Application partition its computation in 3 phases:
            P1: - migrate inA to GPU memory
            P2: - execute functionA on inA producing outA
            P3  - execute functionB on outA producing outB
                - run functionC and see if functionB have found the
                  thing and written it to outB if so then kill all
                  GPU threads and return the result we are done

    [T2] Application partition its computation in 5 phases:
            P1: - migrate first 4GB of inA to GPU memory
            P2: - execute functionA for the 4GB and write the 4GB
                  outA result to the GPU memory
            P3: - execute functionB for the first 4GB of outA
                - while functionB is running DMA in the background
                  the the second 4GB of inA to the GPU memory
                - once one of the millions of thread running functionB
                  find the result it is looking for it writes it to
                  outB which is in main memory
                - run functionC and see if functionB have found the
                  thing and written it to outB if so then kill all
                  GPU thread and DMA and return the result we are
                  done
            P4: - run functionA on the second half of inA ie we did
                  not find the result in the first half so we no
                  process the second half that have been migrated to
                  the GPU memory in the background (see above)
            P5: - run functionB on the second 4GB of outA like
                  above
                - run functionC on CPU and kill everything as soon
                  as one of the thread running functionB has found
                  the result
                - return the result

    [T3] Application partition its computation in 3 phases:
            P1: - migrate first 4GB of inA to GPU1 memory
                - migrate last 4GB of inA to GPU2 memory
            P2: - execute functionA on GPU1 on the first 4GB -> outA
                - execute functionA on GPU2 on the last 4GB -> outA
            P3: - execute functionB on GPU1 on the first 4GB of outA
                - execute functionB on GPU2 on the last 4GB of outA
                - run functionC and see if functionB running on GPU1
                  and GPU2 have found the thing and written it to outB
                  if so then kill all GPU threads and return the result
                  we are done

    [T4] Application partition its computation in 2 phases:
            P1: - migrate 8GB of inA to GPU1 memory
                - allocate 8GB for outA in GPU2 memory
            P2: - execute functionA on GPU1 on the inA 8GB and write
                  out result to GPU2 through the fast link
                - execute functionB on GPU2 and look over each
                  thread on functionB on outA (busy running even
                  if outA is not valid for each thread running
                  functionB)
                - run functionC and see if functionB running on GPU2
                  have found the thing and written it to outB if so
                  then kill all GPU threads and return the result
                  we are done

So this is widely different partition that all depends on the topology
and how accelerator are inter-connected and how much memory they have.
This is a relatively simple example, they are people out there spending
month on designing adaptive partitioning algorithm for their application.

Cheers,
Jérôme