On Tue, Dec 04, 2018 at 10:02:55AM -0800, Dave Hansen wrote: > On 12/3/18 3:34 PM, jglisse@xxxxxxxxxx wrote: > > This means that it is no longer sufficient to consider a flat view > > for each node in a system but for maximum performance we need to > > account for all of this new memory but also for system topology. > > This is why this proposal is unlike the HMAT proposal [1] which > > tries to extend the existing NUMA for new type of memory. Here we > > are tackling a much more profound change that depart from NUMA. > > The HMAT and its implications exist, in firmware, whether or not we do > *anything* in Linux to support it or not. Any system with an HMAT > inherently reflects the new topology, via proximity domains, whether or > not we parse the HMAT table in Linux or not. > > Basically, *ACPI* has decided to extend NUMA. Linux can either fight > that or embrace it. Keith's HMAT patches are embracing it. These > patches are appearing to fight it. Agree? Disagree? Disagree, sorry if it felt that way that was not my intention. The ACPI HMAT information can be use to populate the HMS file system representation. My intention is not to fight Keith's HMAT patches they are useful on their own. But i do not see how to evolve NUMA to support device memory, so while Keith is taking a step into the direction i want, i do not see how to cross to the place i need to be. More on that below. > > Also, could you add a simple, example program for how someone might use > this? I got lost in all the new sysfs and ioctl gunk. Can you > characterize how this would work with the *exiting* NUMA interfaces that > we have? That is the issue i can not expose device memory as NUMA node as device memory is not cache coherent on AMD and Intel platform today. More over in some case that memory is not visible at all by the CPU which is not something you can express in the current NUMA node. Here is an abreviated list of feature i need to support: - device private memory (not accessible by CPU or anybody else) - non-coherent memory (PCIE is not cache coherent for CPU access) - multiple path to access same memory either: - multiple _different_ physical address alias to same memory - device block can select which path they take to access some memory (it is not inside the page table but in how you program the device block) - complex topology that is not a tree where device link can have better characteristics than the CPU inter-connect between the nodes. They are existing today user that use topology information to partition their workload (HPC folks who have a fix platform). - device memory needs to stay under device driver control as some existing API (OpenGL, Vulkan) have different memory model and if we want the device to be use for those too then we need to keep the device driver in control of the device memory allocation There is an example userspace program with the last patch in the serie. But here is a high level overview of how one application looks today: 1) Application get some dataset from some source (disk, network, sensors, ...) 2) Application allocate memory on device A and copy over the dataset 3) Application run some CPU code to format the copy of the dataset inside device A memory (rebuild pointers inside the dataset, this can represent millions and millions of operations) 4) Application run code on device A that use the dataset 5) Application allocate memory on device B and copy over result from device A 6) Application run some CPU code to format the copy of the dataset inside device B (rebuild pointers inside the dataset, this can represent millions and millions of operations) 7) Application run code on device B that use the dataset 8) Application copy result over from device B and keep on doing its thing How it looks with HMS: 1) Application get some dataset from some source (disk, network, sensors, ...) 2-3) Application calls HMS to migrate to device A memory 4) Application run code on device A that use the dataset 5-6) Application calls HMS to migrate to device B memory 7) Application run code on device B that use the dataset 8) Application calls HMS to migrate result to main memory So we now avoid explicit copy and having to rebuild data structure inside each device address space. Above example is for migrate. Here is an example for how the topology is use today: Application knows that the platform is running on have 16 GPU split into 2 group of 8 GPUs each. GPU in each group can access each other memory with dedicated mesh links between each others. Full speed no traffic bottleneck. Application splits its GPU computation in 2 so that each partition runs on a group of interconnected GPU allowing them to share the dataset. With HMS: Application can query the kernel to discover the topology of system it is running on and use it to partition and balance its workload accordingly. Same application should now be able to run on new platform without having to adapt it to it. This is kind of naive i expect topology to be hard to use but maybe it is just me being pesimistics. In any case today we have a chicken and egg problem. We do not have a standard way to expose topology so program that can leverage topology are only done for HPC where the platform is standard for few years. If we had a standard way to expose the topology then maybe we would see more program using it. At very least we could convert existing user. Policy is same kind of story, this email is long enough now :) But i can write one down if you want. Cheers, Jérôme