On Thu, 2017-04-20 at 10:29 -0500, Christoph Lameter wrote: > On Thu, 20 Apr 2017, Balbir Singh wrote: > > Couple of things are needed > > > > 1. Isolation of allocation > > cgroups, memory policy and cpuset provide that Can these be configured appropriately by the accelerator or GPU driver at the point where it hot plugs the memory ? The problem is we need to ensure there is no window in which the kernel will start putting things like skb's etc... in there. My original idea was to cover the whole thing with a CMA, which helps with the case where the user wants to use the "legacy" APIs of manually controlling the allocations on the GPU since in that case, the user/driver might need to do fairly large contiguous allocations. I was told there are some plumbing issues with having a bunch of CMAs around though. Basically the whole debate at the moment revolves around whether to use HMM/CDM/ZONE_DEVICE vs. making it just a NUMA nodes with a sprinkle of added foo. The former approach pretty clearly puts that device into a separate category and keeps most of the VM stuff at bay. However, it has a number of disadvantage. ZONE_DEVICE was meant for providing struct pages & DAX etc... for things like flash storage, "new memory" etc.... What we have here is effectively a bit more like a NUMA node, whose processing unit is just not a CPU but a GPU or some kind of accelerator. The difference boils down to how we want to use is. We want any page, anonymous memory, mapped file, you name it... to be able to migrate back and forth depending on which piece of HW is most actively accessing it. This is helped by a bunch of things such as very fast DMA engines to facilitate migration, and HW counter to detect when parts of that memory are accessed "remotely" (and thus request migrations). So the NUMA model fits reasonably well, with that memory being overall treated normally. The ZONE_DEVICE model on the other hand creates those "special" pages which require a pile of special casing in all sort of places as Balbir has mentioned, with still a bunch of rather standard stuff not working with them. However, we do need to address a few quirks, which is what this is about. Mostly we want to keep kernel allocations away from it, in part because the memory is more prone to fail and not terribly fast for direct CPU access, in part because we want to maximize the availability of it for dedicated applications. I find it clumsy to require establishing policies from userspace after it's been instanciated (and racy). At least for that isolation mechanism. Other things are possibly more realistic to do that way, such as taking KSM and AutoNuma off the picture for it. Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>