On Thu, Dec 06, 2018 at 03:27:06PM -0500, Jerome Glisse wrote: > On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote: > > On 12/6/18 11:20 AM, Jerome Glisse wrote: > > >>> For case 1 you can pre-parse stuff but this can be done by helper library > > >> How would that work? Would each user/container/whatever do this once? > > >> Where would they keep the pre-parsed stuff? How do they manage their > > >> cache if the topology changes? > > > Short answer i don't expect a cache, i expect that each program will have > > > a init function that query the topology and update the application codes > > > accordingly. > > > > My concern with having folks do per-program parsing, *and* having a huge > > amount of data to parse makes it unusable. The largest systems will > > literally have hundreds of thousands of objects in /sysfs, even in a > > single directory. That makes readdir() basically impossible, and makes > > even open() (if you already know the path you want somehow) hard to do fast. > > > > I just don't think sysfs (or any filesystem, really) can scale to > > express large, complicated topologies in a way that any normal program > > can practically parse it. > > > > My suspicion is that we're going to need to have the kernel parse and > > cache these things. We *might* have the data available in sysfs, but we > > can't reasonably expect anyone to go parsing it. > > What i am failing to explain is that kernel can not parse because kernel > does not know what the application cares about and every single applications > will make different choices and thus select differents devices and memory. > > It is not even gonna a thing like class A of application will do X and > class B will do Y. Every single application in class A might do something > different because somes care about the little details. > > So any kind of pre-parsing in the kernel is defeated by the fact that the > kernel does not know what the application is looking for. > > I do not see anyway to express the application logic in something that > can be some kind of automaton or regular expression. The application can > litteraly intro-inspect itself and the topology to partition its workload. > The topology and device selection is expected to be thousands of line of > code in the most advance application. > > Even worse inside one same application, they might be different device > partition and memory selection for different function in the application. > > > I am not scare about the anount of data to parse really, even on big node > it is gonna be few dozens of links and bridges, and few dozens of devices. > So we are talking hundred directories to parse and read. > > > Maybe an example will help. Let say we have an application with the > following pipeline: > > inA -> functionA -> outA -> functionB -> outB -> functionC -> result > > - inA 8 gigabytes > - outA 8 gigabytes > - outB one dword > - result something small > - functionA is doing heavy computation on inA (several thousands of > instructions for each dword in inA). > - functionB is doing heavy computation for each dword in outA (again > thousand of instruction for each dword) and it is looking for a > specific result that it knows will be unique among all the dword > computation ie it is output only one dword in outB > - functionC is something well suited for CPU that take outB and turns > it into the final result > > Now let see few different system and their topologies: > [T2] 1 GPU with 16GB of memory and a handfull of CPU cores > [T1] 1 GPU with 8GB of memory and a handfull of CPU cores > [T3] 2 GPU with 8GB of memory and a handfull of CPU core > [T4] 2 GPU with 8GB of memory and a handfull of CPU core > the 2 GPU have a very fast link between each others > (400GBytes/s) > > Now let see how the program will partition itself for each topology: > [T1] Application partition its computation in 3 phases: > P1: - migrate inA to GPU memory > P2: - execute functionA on inA producing outA > P3 - execute functionB on outA producing outB > - run functionC and see if functionB have found the > thing and written it to outB if so then kill all > GPU threads and return the result we are done > > [T2] Application partition its computation in 5 phases: > P1: - migrate first 4GB of inA to GPU memory > P2: - execute functionA for the 4GB and write the 4GB > outA result to the GPU memory > P3: - execute functionB for the first 4GB of outA > - while functionB is running DMA in the background > the the second 4GB of inA to the GPU memory > - once one of the millions of thread running functionB > find the result it is looking for it writes it to > outB which is in main memory > - run functionC and see if functionB have found the > thing and written it to outB if so then kill all > GPU thread and DMA and return the result we are > done > P4: - run functionA on the second half of inA ie we did > not find the result in the first half so we no > process the second half that have been migrated to > the GPU memory in the background (see above) > P5: - run functionB on the second 4GB of outA like > above > - run functionC on CPU and kill everything as soon > as one of the thread running functionB has found > the result > - return the result > > [T3] Application partition its computation in 3 phases: > P1: - migrate first 4GB of inA to GPU1 memory > - migrate last 4GB of inA to GPU2 memory > P2: - execute functionA on GPU1 on the first 4GB -> outA > - execute functionA on GPU2 on the last 4GB -> outA > P3: - execute functionB on GPU1 on the first 4GB of outA > - execute functionB on GPU2 on the last 4GB of outA > - run functionC and see if functionB running on GPU1 > and GPU2 have found the thing and written it to outB > if so then kill all GPU threads and return the result > we are done > > [T4] Application partition its computation in 2 phases: > P1: - migrate 8GB of inA to GPU1 memory > - allocate 8GB for outA in GPU2 memory > P2: - execute functionA on GPU1 on the inA 8GB and write > out result to GPU2 through the fast link > - execute functionB on GPU2 and look over each > thread on functionB on outA (busy running even > if outA is not valid for each thread running > functionB) > - run functionC and see if functionB running on GPU2 > have found the thing and written it to outB if so > then kill all GPU threads and return the result > we are done > > > So this is widely different partition that all depends on the topology > and how accelerator are inter-connected and how much memory they have. > This is a relatively simple example, they are people out there spending > month on designing adaptive partitioning algorithm for their application. > And since i am writting example, another funny one let say you have a system with 2 nodes and on each node 2 GPU and one network. On each node the local network adapter can only access one of the 2 GPU memory. All the GPU are conntected to each other through a fully symmetrical mesh inter-connect. Now let say your program has 4 functions back to back, each functions consuming the output of the previous one. Finaly you get your input from the network and stream out the final function output to the network So what you can do is: Node0 Net0 -> write to Node0 GPU0 memory Node0 GPU0 -> run first function and write result to Node0 GPU1 Node0 GPU1 -> run second function and write result to Node1 GPU3 Node1 GPU3 -> run third function and write result to Node1 GPU2 Node1 Net1 -> read result from Node1 GPU2 and stream it out Yes this kind of thing can be decided at application startup during initialization. Idea is that you model your program computation graph each node is a function (or group of functions) and each arrow is data flow (input and output). So you have a graph, now what you do is try to find a sub-graph of your system topology that match this graph and for the system topology you also have to check that each of your program node can run on the specific accelerator node of your system (does the accelerator have the feature X and Y ?) If you are not lucky and that there is no 1 to 1 match the you can can re-arrange/simplify your application computation graph. For instance group multiple of your application function node into just one node to shrink your computation graph. Rinse and repeat. Moreover each application will have multiple separate computation graph and the application will want to spread as evenly as possible its workload and select the most powerfull accelerator for the most intensive computation ... I do not see how to have graph matching API with complex testing where you need to query back userspace library. Like querying if the userspace penCL driver for GPU A support feature X ? Which might not only depend on the device generation or kernel device driver version but also on the version of the userspace driver. I feel it would be a lot easier to provide a graph to userspace and have userspace do this complex matching and adaption of its computation graph and load balance its computation at the same time. Of course not all application will be that complex and like i said i believe average app (especialy desktop app design to run on laptop) will just use a dumb down thing ie they will only use one or two devices at the most. Yes all this is hard but easy problems are not interesting to solve. Cheers, Jérôme