RE: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

zhuweixi <weixi.zhu@xxxxxxxxxx> · Fri, 1 Dec 2023 02:37:21 +0000

>From your argument on KVM I can see that the biggest miscommunication between us is that you believed that GMEM wanted to share the whole address space. No, it is not the case. GMEM is only providing coordination via certain mmap() calls. So you are raising a case supporting GMEM again -- passthrough part of the CPU addresses space instead of passthrough the whole CPU address space, is exactly what GMEM can do. On the other side, the IOMMU SVA feature wildly binds the whole address space -- since the hardware feature is to directly share the whole CPU page table.

"We really should never ever encourage people to bind their device address space to the CPU address space. This is a very special use case and limits the driver design to only this use case.
We have exercised this approach to a rather extreme degree with KFD and I can clearly say that doing this was a really big mistake.
As far as I can see you are about to repeat that mistake and even encourage others to do so as well."

-- The behavior of internally "attach device context to mm_struct" in GMEM is ultimately a different approach to coordinate CPU and devices. I want to replace MMU notifiers with this approach because I want to protect core MM from random interactions with external driver MMs. Both GMEM and MMU notifiers are binding device contexts to the CPU context, not putting them in the same address space. If someone is against GMEM's approach for binding CPU and device context, then someone should be against MMU notifiers as well.

Currently, from our discussion I think I received two messages:
	1. The original AMDKFD design was rejected because of inserting vendor-specific stuff to the generic core MM.
	2. The rejection from #1 led to your opinion that anyone cannot mix device and core MM together.

I think #1 really encouraged me that GMEM could help the AMDKFD driver. However I am also confused that why GMEM must be compared with a vendor-specific driver. AMDKFD was only considering a very special use case: AMD GPUs using AMD IOMMU. 
However, GMEM is trying to consider all generalized cases of memory devices. The device can be Nvidia's GPU and Huawei's NPU that use their own MMUs, or AMD/Intel GPUs that use IOMMUs, or other hundreds of new accelerator vendors.

-Weixi

-----Original Message-----
From: Christian König <christian.koenig@xxxxxxx> 
Sent: Thursday, November 30, 2023 9:05 PM
To: zhuweixi <weixi.zhu@xxxxxxxxxx>; Dave Airlie <airlied@xxxxxxxxx>
Cc: linux-mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; weixi.zhu@xxxxxxxxxxxx; mgorman@xxxxxxx; jglisse@xxxxxxxxxx; rcampbell@xxxxxxxxxx; jhubbard@xxxxxxxxxx; apopple@xxxxxxxxxx; mhairgrove@xxxxxxxxxx; ziy@xxxxxxxxxx; alexander.deucher@xxxxxxx; Xinhui.Pan@xxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Felix.Kuehling@xxxxxxx; ogabbay@xxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; jgg@xxxxxxxxxx; leonro@xxxxxxxxxx; zhenyuw@xxxxxxxxxxxxxxx; zhi.a.wang@xxxxxxxxx; intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; jani.nikula@xxxxxxxxxxxxxxx; joonas.lahtinen@xxxxxxxxxxxxxxx; rodrigo.vivi@xxxxxxxxx; tvrtko.ursulin@xxxxxxxxxxxxxxx; Danilo Krummrich <dakr@xxxxxxxxxx>; Daniel Vetter <daniel@xxxxxxxx>; Zeng, Oak <oak.zeng@xxxxxxxxx>
Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

Am 30.11.23 um 08:22 schrieb zhuweixi:
> Add @Oak to the KFD discussion. I will reply separately elaborating your questions on GMEM's difference from HMM/MMU notifiers.
>
> Christian, thanks for pointing me to that AMDKFD discussion. I have read the discussion around the AMDKFD skeleton patch and found the previous discussion in the following URLs:
> https://lore.kernel.org/dri-devel/1405028848-5660-1-git-send-email-ode
> d.gabbay@xxxxxxx/#r 
> https://lore.kernel.org/dri-devel/20140711154231.GB1870@xxxxxxxxx/
>
> I believe AMDKFD's original patch was rejected mostly because of inserting vendor-specific stuff to the generic core MM.  Jérôme has clearly stated this issue in the second URL. If the code is vendor-specific then it has no place in core MM, period.
>
> But why does that vendor-specific solution relate to a generalized solution like GMEM? The initial AMDKFD patch doesn't work for Nvidia or Intel.

KFD was meant to be a vendor agnostic framework, very similar to what you propose here.

It's just that it was seen as vendor specific because nobody else actually wanted to design the their drivers this way.

>
> In fact I think the rejection of the initial AMDKFD patch supports GMEM's idea -- there could have been a simpler AMDKFD implementation if the core MM was extended by GMEM. Also, after 9 years, there are so many other companies building their accelerators over the past few years, especially now the GPT-family has made a much bigger success. Don't we want to advance Linux's core MM for more friendly and generalized support for the upcoming new vendors?

Well exactly that's the big point: Absolutely not!

We really should never ever encourage people to bind their device address space to the CPU address space. This is a very special use case and limits the driver design to only this use case.

We have exercised this approach to a rather extreme degree with KFD and I can clearly say that doing this was a really big mistake.

As far as I can see you are about to repeat that mistake and even encourage others to do so as well.

> Now answering Christian's design concerns:
>
> 1. "There are cases that do not want to share CPU address space"
> Maybe, but I am not fully convinced. The current case we can find is when a NIC utilizes IOMMU for security. For this case, GMEM implemented a generalized VMA support and tested it with NICs using both Intel-IOMMU/Arm-SMMU. This cut 600 LoC of IOVA management code from the IOMMU driver, but it is still not included in this RFC patch -- I cannot find other cases demanding this isolation. The isolation is also unnecessary -- the NIC can enable the IOMMU SVM feature to share the CPU address space. As of KVM, it is essentially a host process that utilizes two different MMUs within the same address space, so it fits GMEM's design...

Maybe I don't completely follow here how you want to save LoC for the IOMMU implementation of NICs, but at least for the PASID/PRI support AMD just recently gone exactly the opposite direction:

commit 5a0b11a180a9b82b4437a4be1cf73530053f139b
Author: Vasant Hegde <vasant.hegde@xxxxxxx>
Date:   Fri Oct 6 09:57:02 2023 +0000

     iommu/amd: Remove iommu_v2 module

     AMD GPU driver which was the only in-kernel user of iommu_v2 module
     removed dependency on iommu_v2 module.

     Also we are working on adding SVA support in AMD IOMMU driver. Device
     drivers are expected to use common SVA framework to enable device
     PASID/PRI features.

     Removing iommu_v2 module and then adding SVA simplifies the development.
     Hence remove iommu_v2 module.

As I wrote before this IOMMU V2 driver was basically binding the CPU address space to IOMMU devices using the PASID. For an example see function amd_iommu_bind_pasid().

This turned out to be not as useful as we hoped it would be. Essentially the use case where you want to give a device access to the whole address space of a process are extremely limited. That's why we are removing it and switching over to a separate SVA implementation which doesn't depend on the CPU address space.

But the virtualization use case I mentioned is completely independent of IOMMU. In KVM/XEN/etc.. there is a functionality called native context, basically what this means is that instead of passing through complete device isolated by IOMMU only specific kernel functionalities are exposed to the guest operating system through QEMU.

See here for an example how OpenGL is implemented on top of this: 
https://docs.mesa3d.org/drivers/virgl.html

This is actually using the separation between device memory management and CPU memory management and is basically a killer argument why those two topics should be separated. Otherwise it's impossible for QEMU to actually handle multiple independent device memory address spaces inside a single CPU memory address space.

> 2. "This does not integrate well with the filesystem layer in Linux..."
> To be honest, not using a logical page table for anonymous memory is why Linux THP fails compared with FreeBSD's superpage, but I am not going to elaborate it here. But yes, and I am looking for merging struct vm_object->logical_page_table with struct address_space->i_pages. This will make a natural support for devices oversubscribing both host DRAM and disks. As explained in my cover letter, struct vm_object borrows FreeBSD's VM design -- it provides a unified abstraction layer for anonymous, file-backed memory and etc.

I'm not that deep into this stuff, so leaving this to the experts on FreeBSD.

> 3. "Requirements to CPU address space management and device address space management are just massively different. For example huge and giant pages are a must have for modern devices..."
> I think you are asking two questions. First, is VA space a problem?

No, this is about something completely different.

> GMEM assumes that device VA space should be covered by CPU VA space (sorry i386), ...
[SNIP]

I'm removing this because you were talking about something different than what I meant.

I will try to explain the background on an example outside of machine learning and compute since this framework should be applicable to every use case and not be limited to those. Otherwise Linux would sooner or later just be applicable to only those use cases.

So let's take a look at how modern games use a GPU for example. On startup a rather large part of the GPU address space is allocated, for example 64GiB. Then the necessary resources (images, texture, vertices, shaders etc..) are loaded into separate buffer objects.

Those resources are then mapped into the allocated address on a page by page basis. So you basically don't have large VMAs which cover one resource, but rather the page tables are used as a remapping table
  into the available resources. This increases the number of virtual mappings drastically, it's kind of comparable how an anon_vma works inside a VMA on Linux.

Those mappings also are not setup at start and then used throughout the whole lifetime of the process, but rather done very dynamically sometimes resulting in thousands of mapping operations per second.

Additional to that devices have page table feature which CPUs don't have. This ranges from support for partial resident texture over flags how caching and dynamic color space compression is made.

So the mappings contain tons of device specific information and it's most likely not even possible to handle all of this with a device independent mmap() call.

> 4. "The argument that a shared memory management leads to less bugs has also absolutely not be proven true. Instead we literally spend month if not years hunting down bugs which resulted from interaction between CPU and devices."
> This is another case supporting GMEM. Don't developers want to let GMEM handle the CPU-device interaction so that they can waive months of debugging cost?

No, we already have HMM for that.

Regards,
Christian.

>
> PS, hmadvise() is based on the idea of Nvidia's cudaMemAdvise() which provides abundant and useful memory policies. HMM extended mbind() instead.
>
> -Weixi
>
> -----Original Message-----
> From: Christian König <christian.koenig@xxxxxxx>
> Sent: Wednesday, November 29, 2023 11:22 PM
> To: zhuweixi <weixi.zhu@xxxxxxxxxx>; Dave Airlie <airlied@xxxxxxxxx>
> Cc: linux-mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; 
> akpm@xxxxxxxxxxxxxxxxxxxx; weixi.zhu@xxxxxxxxxxxx; mgorman@xxxxxxx; 
> jglisse@xxxxxxxxxx; rcampbell@xxxxxxxxxx; jhubbard@xxxxxxxxxx; 
> apopple@xxxxxxxxxx; mhairgrove@xxxxxxxxxx; ziy@xxxxxxxxxx; 
> alexander.deucher@xxxxxxx; Xinhui.Pan@xxxxxxx; 
> amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Felix.Kuehling@xxxxxxx; 
> ogabbay@xxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; jgg@xxxxxxxxxx; 
> leonro@xxxxxxxxxx; zhenyuw@xxxxxxxxxxxxxxx; zhi.a.wang@xxxxxxxxx; 
> intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; 
> jani.nikula@xxxxxxxxxxxxxxx; joonas.lahtinen@xxxxxxxxxxxxxxx; 
> rodrigo.vivi@xxxxxxxxx; tvrtko.ursulin@xxxxxxxxxxxxxxx
> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory 
> management) for external memory devices
>
> Am 29.11.23 um 09:27 schrieb zhuweixi:
>> Glad to hear that more sharable code is desirable.
>> IMHO, for a common MM subsystem, it is more beneficial for GMEM to 
>> extend core MM instead of building a separate one.
>>
>> As stated in the beginning of my RFC letter, MM systems are large and 
>> similar. Even a sophisticated one like Linux MM that has evolved over 
>> decades still suffers from an increasing number of bugs[1]. So, 
>> directly extending core MM to support devices not only avoids opening 
>> a new box of bugs, but also allows the community to concentrate on 
>> maintaining one single MM system. On the other side, GMEM does no 
>> hurt to core MM If a CPU process is not attached with device contexts.
>>
>> @Christian, could you provide more information on what AMD proposed 
>> with KFD and why it was rejected?
> Well, this is going to be a longer explanation.
>
> The combination of KFD and HMM is based on essentially on the same idea as this code here. Even the initial KFD implementation was very similar in the sense that it added device contexts to mm_struct and tried to manage GPU/acceleration MM the same way as CPU MM. On other words it was basically identical to your gm_dev_create() and gm_mmu approach.
>
> As mentioned before this initial proposal was rejected, for more background see the discussion around "amdkfd: Add amdkfd skeleton driver" on the dri-devel mailing list between 2013 and 2014. You need to dig up the whole discussion from the mailing list, but summarizing it the general feeling was that it would be a mistake to tie device drivers to close to CPU memory management (and stable UAPI) without validating that this is really the right thing to do.
>
> So instead of the original implementation KFD has gone upstream with a much less invasive approach where a device contexts are just on demand looked up for each mm_struct. Felix can probably provide some pointers to the implementation.
>
> On the initially supported hardware the KFD used the PCIe ATC feature to allow routing of memory accesses directly into the associated CPU process address space, later on we switched to an MMU notifier/HMM based approach to give similar functionality to the userspace stack on top of it for devices which doesn't support the ATC path was just recently completely removed and we are now only using MMU notifiers/HMM.
>
> HMM tried to add similar functionality like you propose with the mmap() flag and hmadvise() call. The hmadvise() extension actually looks so familiar to the HMM proposal that I would expect that this is actually based on that code.
>
> All this turned out to have some major design issues.
>
> First of all you have a rather large group of use cases where you don't want your device to mirror the address space of your process. Just think of thinks like QEMU, KVM, XEN, in general virtualization and container handling. Linux has the mantra that everything is a file and if it's not a file it should be a file and when you tie device memory management into CPU memory management you are pretty much violating exactly that.
>
> Second this doesn't integrate well with the filesystem layer in Linux.
> For example we do have struct pages for HMM exposed device memory, but 
> for I/O we still migrate this back to system memory because of (for
> example) the page lock requirements around writeback.
>
> Then third it turned out that the requirements to CPU address space management and device address space management are just massively different. For example huge and giant pages are a must have for modern devices, on the CPU side we are barely switching over to folios now to add similar functionality.
>
> The argument that a shared memory management leads to less bugs has also absolutely not be proven true. Instead we literally spend month if not years hunting down bugs which resulted from interaction between CPU and devices.
> ...
>
> There are a couple of more things on this contra side to that approach, but I think that would just make this mail unnecessary long.
>
> To sum it up from over a decade of experience working in this area I can just say that CPU and device memory management should absolutely *NOT* be mixed. We had those ideas multiple times before, but they either failed because they didn't integrated well with the core OS or the hardware support is just lagging behind the actual requirements.
>
> What can be done and where I completely agree with Dave is that having common components which provides device drivers with the necessary functionality to manage their device address space is really good idea.
> Danilo is for example working on a GPUVM component to have common virtual address space management and I'm at least sometimes working on MMU notifier/HMM improvements.
>
> Providing SVM functionality to your userspace stack is still a really good idea, but it should be done with MMU notifiers and components which are separate to your CPU memory management instead of tying it directly to the CPU address space.
>
> Regards,
> Christian.
>
>> [1] Huang, Jian, Moinuddin K. Qureshi, and Karsten Schwan. "An evolutionary study of linux memory management for fun and profit." 2016 USENIX Annual Technical Conference (USENIX ATC 16). 2016.
>>
>> Thanks,
>> Weixi
>>
>> -----Original Message-----
>> From: Dave Airlie <airlied@xxxxxxxxx>
>> Sent: Wednesday, November 29, 2023 1:15 PM
>> To: Christian König <christian.koenig@xxxxxxx>
>> Cc: zhuweixi <weixi.zhu@xxxxxxxxxx>; linux-mm@xxxxxxxxx; 
>> linux-kernel@xxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; 
>> weixi.zhu@xxxxxxxxxxxx; mgorman@xxxxxxx; jglisse@xxxxxxxxxx; 
>> rcampbell@xxxxxxxxxx; jhubbard@xxxxxxxxxx; apopple@xxxxxxxxxx; 
>> mhairgrove@xxxxxxxxxx; ziy@xxxxxxxxxx; alexander.deucher@xxxxxxx; 
>> Xinhui.Pan@xxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; 
>> Felix.Kuehling@xxxxxxx; ogabbay@xxxxxxxxxx; 
>> dri-devel@xxxxxxxxxxxxxxxxxxxxx; jgg@xxxxxxxxxx; leonro@xxxxxxxxxx; 
>> zhenyuw@xxxxxxxxxxxxxxx; zhi.a.wang@xxxxxxxxx; 
>> intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; 
>> jani.nikula@xxxxxxxxxxxxxxx; joonas.lahtinen@xxxxxxxxxxxxxxx; 
>> rodrigo.vivi@xxxxxxxxx; tvrtko.ursulin@xxxxxxxxxxxxxxx
>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> management) for external memory devices
>>
>> On Tue, 28 Nov 2023 at 23:07, Christian König <christian.koenig@xxxxxxx> wrote:
>>> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>>>> The problem:
>>>>
>>>> Accelerator driver developers are forced to reinvent external MM 
>>>> subsystems case by case, because Linux core MM only considers host memory resources.
>>>> These reinvented MM subsystems have similar orders of magnitude of 
>>>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and 
>>>> Huawei NPU has 30K. Meanwhile, more and more vendors are 
>>>> implementing their own accelerators, e.g. Microsoft's Maia 100. At 
>>>> the same time, application-level developers suffer from poor 
>>>> programmability -- they must consider parallel address spaces and 
>>>> be careful about the limited device DRAM capacity. This can be 
>>>> alleviated if a malloc()-ed virtual address can be shared by the 
>>>> accelerator, or the abundant host DRAM can further transparently backup the device local memory.
>>>>
>>>> These external MM systems share similar mechanisms except for the 
>>>> hardware-dependent part, so reinventing them is effectively 
>>>> introducing redundant code (14K~70K for each case). Such 
>>>> developing/maintaining is not cheap. Furthermore, to share a 
>>>> malloc()-ed virtual address, device drivers need to deeply interact 
>>>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This 
>>>> raises the bar for driver development, since developers must 
>>>> understand how Linux MM works. Further, it creates code maintenance 
>>>> problems -- any changes to Linux MM potentially require coordinated changes to accelerator drivers using low-level MM APIs.
>>>>
>>>> Putting a cache-coherent bus between host and device will not make 
>>>> these external MM subsystems disappear. For example, a 
>>>> throughput-oriented accelerator will not tolerate executing heavy 
>>>> memory access workload with a host MMU/IOMMU via a remote bus.
>>>> Therefore, devices will still have their own MMU and pick a simpler 
>>>> page table format for lower address translation overhead, requiring external MM subsystems.
>>>>
>>>> --------------------
>>>>
>>>> What GMEM (Generalized Memory Management [1]) does:
>>>>
>>>> GMEM extends Linux MM to share its machine-independent MM code. 
>>>> Only high-level interface is provided for device drivers. This 
>>>> prevents accelerator drivers from reinventing the wheel, but relies 
>>>> on drivers to implement their hardware-dependent functions declared 
>>>> by GMEM. GMEM's key interface include gm_dev_create(), 
>>>> gm_as_create(),
>>>> gm_as_attach() and gm_dev_register_physmem(). Here briefly describe 
>>>> how a device driver utilizes them:
>>>> 1. At boot time, call gm_dev_create() and registers the implementation of
>>>>       hardware-dependent functions as declared in struct gm_mmu.
>>>>         - If the device has local DRAM, call gm_dev_register_physmem() to
>>>>           register available physical addresses.
>>>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>>>>       the current CPU process has been attached to a gmem address space
>>>>       (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>>>>       to it.
>>>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>>>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>>>>       device computation happens.
>>>>
>>>> GMEM has changed the following assumptions in Linux MM:
>>>>      1. An mm_struct not only handle a single CPU context, but may also handle
>>>>         external memory contexts encapsulated as gm_context listed in
>>>>         mm->gm_as. An external memory context can include a few or all of the
>>>>         following parts: an external MMU (that requires TLB invalidation), an
>>>>         external page table (that requires PTE manipulation) and external DRAM
>>>>         (that requires physical memory management).
>>> Well that is pretty much exactly what AMD has already proposed with 
>>> KFD and was rejected for rather good reasons.
>>>> MMU functions
>>>> The MMU functions peer_map() and peer_unmap() overlap other 
>>>> functions, leaving a question if the MMU functions should be 
>>>> decoupled as more basic operations. Decoupling them could 
>>>> potentially prevent device drivers coalescing these basic steps 
>>>> within a single host-device communication operation, while coupling 
>>>> them makes it more difficult for device drivers to utilize GMEM interface.
>>> Well to be honest all of this sounds like history to me. We have 
>>> already seen the same basic approach in KFD, HMM and to some extend in TTM as well.
>>>
>>> And all of them more or less failed. Why should this here be different?
>> Any info we have on why this has failed to work in the past would be 
>> useful to provide. This is one of those cases where we may not have 
>> documented the bad ideas to stop future developers from thinking they 
>> are bad.
>>
>> I do think we would want more common code in this area, but I would 
>> think we'd have it more on the driver infrastructure side, than in 
>> the core mm.
>>
>> Dave.