On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote: > On 12/6/18 11:20 AM, Jerome Glisse wrote: > >>> For case 1 you can pre-parse stuff but this can be done by helper library > >> How would that work? Would each user/container/whatever do this once? > >> Where would they keep the pre-parsed stuff? How do they manage their > >> cache if the topology changes? > > Short answer i don't expect a cache, i expect that each program will have > > a init function that query the topology and update the application codes > > accordingly. > > My concern with having folks do per-program parsing, *and* having a huge > amount of data to parse makes it unusable. The largest systems will > literally have hundreds of thousands of objects in /sysfs, even in a > single directory. That makes readdir() basically impossible, and makes > even open() (if you already know the path you want somehow) hard to do fast. > > I just don't think sysfs (or any filesystem, really) can scale to > express large, complicated topologies in a way that any normal program > can practically parse it. > > My suspicion is that we're going to need to have the kernel parse and > cache these things. We *might* have the data available in sysfs, but we > can't reasonably expect anyone to go parsing it. What i am failing to explain is that kernel can not parse because kernel does not know what the application cares about and every single applications will make different choices and thus select differents devices and memory. It is not even gonna a thing like class A of application will do X and class B will do Y. Every single application in class A might do something different because somes care about the little details. So any kind of pre-parsing in the kernel is defeated by the fact that the kernel does not know what the application is looking for. I do not see anyway to express the application logic in something that can be some kind of automaton or regular expression. The application can litteraly intro-inspect itself and the topology to partition its workload. The topology and device selection is expected to be thousands of line of code in the most advance application. Even worse inside one same application, they might be different device partition and memory selection for different function in the application. I am not scare about the anount of data to parse really, even on big node it is gonna be few dozens of links and bridges, and few dozens of devices. So we are talking hundred directories to parse and read. Maybe an example will help. Let say we have an application with the following pipeline: inA -> functionA -> outA -> functionB -> outB -> functionC -> result - inA 8 gigabytes - outA 8 gigabytes - outB one dword - result something small - functionA is doing heavy computation on inA (several thousands of instructions for each dword in inA). - functionB is doing heavy computation for each dword in outA (again thousand of instruction for each dword) and it is looking for a specific result that it knows will be unique among all the dword computation ie it is output only one dword in outB - functionC is something well suited for CPU that take outB and turns it into the final result Now let see few different system and their topologies: [T2] 1 GPU with 16GB of memory and a handfull of CPU cores [T1] 1 GPU with 8GB of memory and a handfull of CPU cores [T3] 2 GPU with 8GB of memory and a handfull of CPU core [T4] 2 GPU with 8GB of memory and a handfull of CPU core the 2 GPU have a very fast link between each others (400GBytes/s) Now let see how the program will partition itself for each topology: [T1] Application partition its computation in 3 phases: P1: - migrate inA to GPU memory P2: - execute functionA on inA producing outA P3 - execute functionB on outA producing outB - run functionC and see if functionB have found the thing and written it to outB if so then kill all GPU threads and return the result we are done [T2] Application partition its computation in 5 phases: P1: - migrate first 4GB of inA to GPU memory P2: - execute functionA for the 4GB and write the 4GB outA result to the GPU memory P3: - execute functionB for the first 4GB of outA - while functionB is running DMA in the background the the second 4GB of inA to the GPU memory - once one of the millions of thread running functionB find the result it is looking for it writes it to outB which is in main memory - run functionC and see if functionB have found the thing and written it to outB if so then kill all GPU thread and DMA and return the result we are done P4: - run functionA on the second half of inA ie we did not find the result in the first half so we no process the second half that have been migrated to the GPU memory in the background (see above) P5: - run functionB on the second 4GB of outA like above - run functionC on CPU and kill everything as soon as one of the thread running functionB has found the result - return the result [T3] Application partition its computation in 3 phases: P1: - migrate first 4GB of inA to GPU1 memory - migrate last 4GB of inA to GPU2 memory P2: - execute functionA on GPU1 on the first 4GB -> outA - execute functionA on GPU2 on the last 4GB -> outA P3: - execute functionB on GPU1 on the first 4GB of outA - execute functionB on GPU2 on the last 4GB of outA - run functionC and see if functionB running on GPU1 and GPU2 have found the thing and written it to outB if so then kill all GPU threads and return the result we are done [T4] Application partition its computation in 2 phases: P1: - migrate 8GB of inA to GPU1 memory - allocate 8GB for outA in GPU2 memory P2: - execute functionA on GPU1 on the inA 8GB and write out result to GPU2 through the fast link - execute functionB on GPU2 and look over each thread on functionB on outA (busy running even if outA is not valid for each thread running functionB) - run functionC and see if functionB running on GPU2 have found the thing and written it to outB if so then kill all GPU threads and return the result we are done So this is widely different partition that all depends on the topology and how accelerator are inter-connected and how much memory they have. This is a relatively simple example, they are people out there spending month on designing adaptive partitioning algorithm for their application. Cheers, Jérôme