"Zeng, Oak" <oak.zeng@xxxxxxxxx> writes: > See inline comments > >> -----Original Message----- >> From: dri-devel <dri-devel-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of >> zhuweixi >> Sent: Thursday, November 30, 2023 5:48 AM >> To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Zeng, Oak >> <oak.zeng@xxxxxxxxx>; Christian König <christian.koenig@xxxxxxx>; linux- >> mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; >> Danilo Krummrich <dakr@xxxxxxxxxx>; Dave Airlie <airlied@xxxxxxxxxx>; Daniel >> Vetter <daniel@xxxxxxxx> >> Cc: tvrtko.ursulin@xxxxxxxxxxxxxxx; rcampbell@xxxxxxxxxx; apopple@xxxxxxxxxx; >> ziy@xxxxxxxxxx; weixi.zhu@xxxxxxxxxxxx; jhubbard@xxxxxxxxxx; intel- >> gfx@xxxxxxxxxxxxxxxxxxxxx; mhairgrove@xxxxxxxxxx; Wang, Zhi A >> <zhi.a.wang@xxxxxxxxx>; Xinhui.Pan@xxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; >> jglisse@xxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; jgg@xxxxxxxxxx; Vivi, >> Rodrigo <rodrigo.vivi@xxxxxxxxx>; alexander.deucher@xxxxxxx; >> Felix.Kuehling@xxxxxxx; intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx; >> ogabbay@xxxxxxxxxx; leonro@xxxxxxxxxx; mgorman@xxxxxxx >> Subject: RE: [RFC PATCH 0/6] Supporting GMEM (generalized memory >> management) for external memory devices >> >> Glad to know that there is a common demand for a new syscall like hmadvise(). I >> expect it would also be useful for homogeneous NUMA cases. Credits to >> cudaMemAdvise() API which brought this idea to GMEM's design. >> >> To answer @Oak's questions about GMEM vs. HMM, >> >> Here is the major difference: >> GMEM's main target is to stop drivers from reinventing MM code, while >> HMM/MMU notifiers provide a compatible struct page solution and a >> coordination mechanism for existing device driver MMs that requires adding >> extra code to interact with CPU MM. >> >> A straightforward qualitative result for the main target: after integrating Huawei's >> Ascend NPU driver with GMEM's interface, 30,000 lines of MM code were cut, >> leaving <100 lines invoking GMEM interface and 3,700 lines implementing vendor- >> specific functions. Some code from the 3,700 lines should be further moved to >> GMEM as a generalized feature like device memory oversubscription, but not >> included in this RFC patch yet. >> >> A list of high-level differences: >> 1. With HMM/MMU notifiers, drivers need to first implement a full MM >> subsystem. With GMEM, drivers can reuse Linux's core MM. > > A full mm subsystem essentially has below functions: > > Physical memory management: neither your approach nor hmm-based > solution provide device physical memory management. You mentioned you > have a plan but at least for now driver need to mange device physical > memory. > > Virtual address space management: both approach leverage linux core mm, vma for this. > > Data eviction, migration: with hmm, driver need to implement this. It > is not clear whether gmem has this function. I guess even gmem has it, > it might be slow cpu data copy, compared to modern gpu's fast data > copy engine. > > Device page table update, va-pa mapping: I think it is driver's responsibility in both approach. > > So from the point of re-use core MM, I don't see big difference. Maybe > you did it more elegantly. I think it is very possible with your > approach driver can be simpler, less codes. > >> >> 2. HMM encodes device mapping information in the CPU arch-dependent PTEs, >> while GMEM proposes an abstraction layer in vm_object. Since GMEM's >> approach further decouples the arch-related stuff, drivers do not need to >> implement separate code for X86/ARM and etc. > > I don't understand this...with hmm, when a virtual address range's > backing store is in device memory, cpu pte is encoded to point to > device memory. Device page table is also encoded to point to the same > device memory location. But since device memory is not accessible to > CPU (DEVICE_PRIVATE), so when cpu access this virtual address, there > is a cpu page fault. Device mapping info is still in device page > table, not in cpu ptes. > > I do not see with hmm why driver need to implement x86/arm > code... driver only take cares of device page table. Hmm/core mm take > care of cpu page table, right? I see our replies have crossed, but that is my understanding as well. >> >> 3. MMU notifiers register hooks at certain core MM events, while GMEM >> declares basic functions and internally invokes them. GMEM requires less from >> the driver side -- no need to understand what core MM behaves at certain MMU >> events. GMEM also expects fewer bugs than MMU notifiers: implementing basic >> operations with standard declarations vs. implementing whatever random device >> MM logic in MMU notifiers. > > This seems true to me. I feel the mmu notifier thing, especially the > synchronization/lock design (those sequence numbers, interacting with > driver lock, and the mmap lock) are very complicated. I indeed spent > time to understand the specification documented in hmm.rst... No argument there, but I think that's something we could look at providing an improved interface for. I don't think it needs a whole new subsystem to fix. Probably just a version of hmm_range_fault() that takes the lock and sets up a MMU notifier itself. I do think there is value in getting notified when core MM programs new PTEs though as it would avoid expensive device faults. That's something there is currently no way of doing. > Your approach seems better. > >> >> 4. GMEM plans to support a more lightweight physical memory management. >> The discussion about this part can be found in my cover letter. The question is >> whether struct page should be compatible (directly use HMM's ZONE_DEVICE >> solution) or a trimmed, smaller struct page that satisfies generalized demands >> from accelerators is more preferrable? >> >> 5. GMEM has been demonstrated to allow device memory oversubscription (a >> GMEM-based 32GB NPU card can run a GPT model oversubscribing 500GB host >> DDR), while drivers using HMM/MMU notifier must implement this logic one by >> one. I will submit this part in a future RFC patch. > > When device memory is oversubscribed, do you call a driver callback > function to evict device memory to system memory? Or just cpu copy? > Copy with device's fast copy engine is faster. > > I can see even though with both approach we need to implement a driver > copy function, with your approach, the driver logic can be > simplified. With today's drm/ttm, I do see the logic in the memory > eviction area is very complicated. Those eviction fence (some call it > suspend fence), dma-fence enable signalling....very complicated to me. > > Essentially evict device memory to system memory is nothing different > from evict system memory to disk... so if your approach can leverage > some linux core mm eviction logic, I do see it can simplify things > here... > >> >> I want to reiterate that GMEM's shared address space support is a bonus result, >> not a main contribution... It was done because it was not difficult to implement >> internal CPU-device coordination mechanism when core MM is extended by >> GMEM to support devices. > > Besides memory eviction/oversubscription, there are a few other pain points when I use hmm: > > 1) hmm doesn't support file-back memory, so it is hard to share memory > b/t process in a gpu environment. You mentioned you have a plan... How > hard is it to support file-backed in your approach? > 2)virtual address range based memory attribute/hint: with hmadvise, > where do you save the memory attribute of a virtual address range? Do > you need to extend vm_area_struct to save it? With hmm, we have to > maintain such information at driver. This ends up with pretty > complicated logic to split/merge those address range. I know core mm > has similar logic to split/merge vma... > > Oak > > >> >> -Weixi >> >> -----Original Message----- >> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >> Sent: Thursday, November 30, 2023 4:28 PM >> To: Zeng, Oak <oak.zeng@xxxxxxxxx>; Christian König >> <christian.koenig@xxxxxxx>; zhuweixi <weixi.zhu@xxxxxxxxxx>; linux- >> mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; >> Danilo Krummrich <dakr@xxxxxxxxxx>; Dave Airlie <airlied@xxxxxxxxxx>; Daniel >> Vetter <daniel@xxxxxxxx> >> Cc: intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx; rcampbell@xxxxxxxxxx; >> mhairgrove@xxxxxxxxxx; jgg@xxxxxxxxxx; weixi.zhu@xxxxxxxxxxxx; >> jhubbard@xxxxxxxxxx; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; apopple@xxxxxxxxxx; >> Xinhui.Pan@xxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; >> tvrtko.ursulin@xxxxxxxxxxxxxxx; ogabbay@xxxxxxxxxx; jglisse@xxxxxxxxxx; dri- >> devel@xxxxxxxxxxxxxxxxxxxxx; ziy@xxxxxxxxxx; Vivi, Rodrigo >> <rodrigo.vivi@xxxxxxxxx>; alexander.deucher@xxxxxxx; leonro@xxxxxxxxxx; >> Felix.Kuehling@xxxxxxx; Wang, Zhi A <zhi.a.wang@xxxxxxxxx>; >> mgorman@xxxxxxx >> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory >> management) for external memory devices >> >> Hi Oak, >> >> yeah, #4 is indeed a really good point and I think Felix will agree to that as well. >> >> HMM is basically still missing a way to advise device attributes for the CPU >> address space. Both migration strategy as well as device specific information (like >> cache preferences) fall into this category. >> >> Since there is a device specific component in those attributes as well I think >> device specific IOCTLs still make sense to update them, but HMM should offer >> the functionality to manage and store those information. >> >> Split and merge of VMAs only become a problem if you attach those information >> to VMAs, if you keep them completely separate than that doesn't become an >> issue either. The down side of this approach is that you don't get automatically >> extending attribute ranges for growing VMAs for example. >> >> Regards, >> Christian. >> >> Am 29.11.23 um 23:23 schrieb Zeng, Oak: >> > Hi Weixi, >> > >> > Even though Christian has listed reasons rejecting this proposal (yes they are >> very reasonable to me), I would open my mind and further explore the possibility >> here. Since the current GPU driver uses a hmm based implementation (AMD and >> NV has done this; At Intel we are catching up), I want to explore how much we >> can benefit from the proposed approach and how your approach can solve some >> pain points of our development. So basically what I am questioning here is: what >> is the advantage of your approach against hmm. >> > >> > To implement a UVM (unified virtual address space b/t cpu and gpu device), >> with hmm, driver essentially need to implement below functions: >> > >> > 1. device page table update. Your approach requires the same because >> > this is device specific codes >> > >> > 2. Some migration functions to migrate memory b/t system memory and GPU >> local memory. My understanding is, even though you generalized this a bit, such >> as modified cpu page fault path, provided "general" gm_dev_fault handler... but >> device driver still need to provide migration functions because migration >> functions have to be device specific (i.e., using device dma/copy engine for >> performance purpose). Right? >> > >> > 3. GPU physical memory management, this part is now in drm/buddy, shared >> by all drivers. I think with your approach, driver still need to provide callback >> functions to allocate/free physical pages. Right? Or do you let linux core mm >> buddy manage device memory directly? >> > >> > 4. madvise/hints/virtual address range management. This has been pain point >> for us. Right now device driver has to maintain certain virtual address range data >> structure to maintain hints and other virtual address range based memory >> attributes. Driver need to sync with linux vma. Driver need to explicitly deal with >> range split/merging... HMM doesn't provide support in this area. Your approach >> seems cleaner/simpler to me... >> > >> > >> > So in above, I have examined the some key factors of a gpu UVM memory >> manager. I think for #1 and #2, hmm has provide pretty good abstraction/tools >> for address space mirroring and migration helpers. For #3, since we have a >> common drm/buddy layer, I don't think it is a big problem for driver writer now. >> > >> > I do see #4 is something you solved more beautifully, requires new system call >> though. >> > >> > Oak >> > >> > >> >> -----Original Message----- >> >> From: dri-devel <dri-devel-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf >> >> Of Christian König >> >> Sent: Tuesday, November 28, 2023 8:09 AM >> >> To: Weixi Zhu <weixi.zhu@xxxxxxxxxx>; linux-mm@xxxxxxxxx; linux- >> >> kernel@xxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; Danilo Krummrich >> >> <dakr@xxxxxxxxxx>; Dave Airlie <airlied@xxxxxxxxxx>; Daniel Vetter >> >> <daniel@xxxxxxxx> >> >> Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; leonro@xxxxxxxxxx; >> >> apopple@xxxxxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; mgorman@xxxxxxx; >> >> ziy@xxxxxxxxxx; Wang, Zhi A <zhi.a.wang@xxxxxxxxx>; >> >> rcampbell@xxxxxxxxxx; jgg@xxxxxxxxxx; weixi.zhu@xxxxxxxxxxxx; >> >> jhubbard@xxxxxxxxxx; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; >> >> mhairgrove@xxxxxxxxxx; jglisse@xxxxxxxxxx; Vivi, Rodrigo >> >> <rodrigo.vivi@xxxxxxxxx>; intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx; >> >> tvrtko.ursulin@xxxxxxxxxxxxxxx; Felix.Kuehling@xxxxxxx; >> >> Xinhui.Pan@xxxxxxx; alexander.deucher@xxxxxxx; ogabbay@xxxxxxxxxx >> >> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory >> >> management) for external memory devices >> >> >> >> Adding a few missing important people to the explicit to list. >> >> >> >> Am 28.11.23 um 13:50 schrieb Weixi Zhu: >> >>> The problem: >> >>> >> >>> Accelerator driver developers are forced to reinvent external MM >> >>> subsystems case by case, because Linux core MM only considers host >> memory resources. >> >>> These reinvented MM subsystems have similar orders of magnitude of >> >>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and >> >>> Huawei NPU >> >> has >> >>> 30K. Meanwhile, more and more vendors are implementing their own >> >>> accelerators, e.g. Microsoft's Maia 100. At the same time, >> >>> application-level developers suffer from poor programmability -- >> >>> they must consider parallel address spaces and be careful about the >> >>> limited device DRAM capacity. This can be alleviated if a >> >>> malloc()-ed virtual address can be shared by the accelerator, or the >> >>> abundant host DRAM can further transparently backup the device local >> memory. >> >>> >> >>> These external MM systems share similar mechanisms except for the >> >>> hardware-dependent part, so reinventing them is effectively >> >>> introducing redundant code (14K~70K for each case). Such >> >>> developing/maintaining is not cheap. Furthermore, to share a >> >>> malloc()-ed virtual address, device drivers need to deeply interact >> >>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This >> >>> raises the bar for driver development, since developers must >> >>> understand how Linux MM works. Further, it creates code maintenance >> >>> problems -- any changes to Linux MM potentially require coordinated >> changes to accelerator drivers using low-level MM APIs. >> >>> >> >>> Putting a cache-coherent bus between host and device will not make >> >>> these external MM subsystems disappear. For example, a >> >>> throughput-oriented accelerator will not tolerate executing heavy >> >>> memory access workload with a host MMU/IOMMU via a remote bus. >> >>> Therefore, devices will still have their own MMU and pick a simpler >> >>> page table format for lower address translation overhead, requiring external >> MM subsystems. >> >>> >> >>> -------------------- >> >>> >> >>> What GMEM (Generalized Memory Management [1]) does: >> >>> >> >>> GMEM extends Linux MM to share its machine-independent MM code. Only >> >>> high-level interface is provided for device drivers. This prevents >> >>> accelerator drivers from reinventing the wheel, but relies on >> >>> drivers to implement their hardware-dependent functions declared by >> >>> GMEM. GMEM's >> >> key >> >>> interface include gm_dev_create(), gm_as_create(), gm_as_attach() >> >>> and gm_dev_register_physmem(). Here briefly describe how a device >> >>> driver utilizes them: >> >>> 1. At boot time, call gm_dev_create() and registers the implementation of >> >>> hardware-dependent functions as declared in struct gm_mmu. >> >>> - If the device has local DRAM, call gm_dev_register_physmem() to >> >>> register available physical addresses. >> >>> 2. When a device context is initialized (e.g. triggered by ioctl), check if >> >>> the current CPU process has been attached to a gmem address space >> >>> (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as >> >>> to it. >> >>> 3. Call gm_as_attach() to attach the device context to a gmem address space. >> >>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before >> >>> device computation happens. >> >>> >> >>> GMEM has changed the following assumptions in Linux MM: >> >>> 1. An mm_struct not only handle a single CPU context, but may also handle >> >>> external memory contexts encapsulated as gm_context listed in >> >>> mm->gm_as. An external memory context can include a few or all of the >> >>> following parts: an external MMU (that requires TLB invalidation), an >> >>> external page table (that requires PTE manipulation) and external DRAM >> >>> (that requires physical memory management). >> >>> 2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not >> necessarily >> >>> mean that a zero-filled physical page should be mapped. The virtual >> >>> page may have been mapped to an external memory device. >> >>> 3. Unmapping a page may include sending device TLB invalidation (even if >> >>> its MMU shares CPU page table) and manipulating device PTEs. >> >>> >> >>> -------------------- >> >>> >> >>> Semantics of new syscalls: >> >>> >> >>> 1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED) >> >>> Allocate virtual address that is shared between the CPU and all >> >>> attached devices. Data is guaranteed to be coherent whenever the >> >>> address is accessed by either CPU or any attached device. If the device >> >>> does not support page fault, then device driver is responsible for >> >>> faulting memory before data gets accessed. By default, the CPU DRAM is >> >>> can be used as a swap backup for the device local memory. >> >>> 2. hmadvise(NUMA_id, va_start, size, memory_hint) >> >>> Issuing memory hint for a given VMA. This extends traditional madvise() >> >>> syscall with an extra argument so that programmers have better control >> >>> with heterogeneous devices registered as NUMA nodes. One >> >>> useful >> >> memory >> >>> hint could be MADV_PREFETCH, which guarantees that the physical data >> of >> >>> the given VMA [VA, VA+size) is migrated to NUMA node #id. Another >> >>> useful memory hint is MADV_DONTNEED. This is helpful to increase >> device >> >>> memory utilization. It is worth considering extending the existing >> >>> madvise() syscall with one additional argument. >> >>> >> >>> -------------------- >> >>> >> >>> Implementation details >> >>> >> >>> 1. New VMA flag: MAP_PEER_SHARED >> >>> >> >>> This new flag helps isolate GMEM feature, so that common processes >> >>> with no device attached does not need to maintain any logical page >> >>> table. It can be deleted if the extra overhead from GMEM is acceptable. >> >>> >> >>> 2. MMU functions >> >>> The device driver must implement the MMU functions declared in >> >>> struct gm_mmu. >> >>> >> >>> VA functions: peer_va_alloc_fixed(), peer_va_free() >> >>> >> >>> They are used to negotiate a common available VMA between a host >> >>> process and a device process at the mmap() time. This is because >> >>> some accelerators like Intel Xeon Phi or Huawei's Ascend NPU have >> >>> their acceleration tasks executed within a device CPU process >> >>> context. Some accelerators may also choose a different format of >> >>> virtual address space. >> >>> >> >>> PA functions: alloc_page(), free_page(), prepare_page() >> >>> >> >>> Alloc_page() and free_page() are used to allocate and free device >> >>> physical pages. Prepare_page() is used to zero-fill or DMA the data >> >>> of a physical page. These functions were removed from the submitted >> >>> patch, since GMEM does not need to invoke them when testing Huawei's >> >>> NPU accelerator. The >> >> NPU >> >>> accelerator has an OS running in the device that manages the device >> >>> physical memory. However, even for such a device it is better for >> >>> the host to directly manage device physical memory, which saves >> >>> device HBM and avoids synchronizing management status between the host >> and device. >> >>> >> >>> Page-table functions: >> >>> pmap_create()/destroy()/enter()/release()/protect() >> >>> >> >>> They are used to create and destroy device page tables, install and >> >>> uninstall page table entries and to change the protection of page >> >>> table entries. >> >>> >> >>> TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced() >> >>> >> >>> They are used to invalidate the TLB entries of a given range of VA >> >>> or invalidate a given list of VMAs. >> >>> >> >>> Wrapper functions: peer_map() and peer_unmap() >> >>> >> >>> These two functions are used to create or destroy a device mapping >> >>> which could include allocating physical memory and copying data. >> >>> They effectively wraps the PA functions, Page-table functions and >> >>> TLB-invalidation functions. Implementing these steps together allows >> >>> devices to optimize the communication cost between host and device. >> >>> However, it requires the device driver to correctly order these steps. >> >>> >> >>> 3. Tracking logical mappings: >> >>> >> >>> Each process starts maintaining an xarray in >> >>> mm->vm_obj->logical_page_table at the first time a host process >> >>> calls mmap(MAP_PRIVATE | >> >> MAP_PEER_SHARED). >> >>> When a virtual page gets touched, its mapping status is created and >> >>> stored in struct gm_mapping. The logical page table is utilized to >> >>> query the struct gm_mapping given a virtual address. GMEM extends >> >>> Linux MM to >> >> update >> >>> and lookup these logical mappings. For example, in the patch set we >> >>> modify the page fault path of to additionally check the logical >> >>> mapping of MAP_PEER_SHARED VMAs and identify if a device page should >> be migrated. >> >>> Similarly, if the device driver wants to resolve a device page fault >> >>> or prefetch data, the driver should call gm_dev_fault(). This >> >>> function examines the mapping status and determines whether the >> >>> device driver should migrate a CPU page to device or install a zero-filled >> device page. >> >>> >> >>> The logical mapping abstraction enhances the extensibility of Linux >> >>> core MM (a virtual page may be mapped to a device physical page >> >>> without any CPU PTE installed). The current implementation is not >> >>> complete, since it only focused on anonymous VMAs with >> >>> MAP_PEER_SHARED flag. The future plan of logical page table is to >> >>> provide a generic abstraction layer that support common anonymous >> >>> memory (I am looking at you, transparent huge pages) >> >> and >> >>> file-backed memory. >> >>> >> >>> -------------------- >> >>> >> >>> Use cases >> >>> >> >>> GMEM has been tested over Huawei's NPU (neural process unit) device >> driver. >> >>> The original NPU device driver has approximately 30,000 lines of >> >>> code for memory management. On the contrary, the GMEM-based one has >> >>> less than 30 lines of code calling GMEM API, with approximately >> >>> 3,700 lines of code implementing the MMU functions. This effectively >> >>> saves over 26,200 lines of MM code for one driver. Therefore, >> >>> developers from accelerator vendors, including Nvidia, AMD, Intel >> >>> and other companies are welcome to discuss if GMEM could be helpful. >> >>> >> >>> Using GMEM-based driver, it is possible to write a C-style >> >>> accelerator code with malloc(), whose underlying mmap() syscall >> >>> should include MAP_PEER_SHARED according to current GMEM >> >>> implementation. Importantly, >> >> GMEM >> >>> guarantees a coherent view of memory between the host and all >> >>> attached devices. This means that any data written by the CPU or any >> >>> attached accelerator can be seen by the next memory load instruction >> >>> issued by any attached accelerator or the CPU. Furthermore, the NPU >> >>> device was able to oversubscribe memory by swapping memory to host >> >>> DDR. Note that this >> >> memory >> >>> oversubscription mechanism can be universal if the physical memory >> >>> management is provided by GMEM. Other potential use cases of GMEM >> >>> could include the IOMMU driver, KVM and RDMA drivers, as long as the >> >>> device needs to manage external memory resources like VMAs, MMUs or >> local DRAMs. >> >>> >> >>> -------------------- >> >>> >> >>> Discussion >> >>> >> >>> Physical memory management >> >>> Most accelerators require the host OS to manage device DRAM. Even >> >>> accelerators capable of running an OS inside the driver can benefit >> >>> from it, since it helps avoid synchronizing management status >> >>> between the host and device. In Linux OSS EU summit 2023, Hannes >> >>> Reinecke from SUSE Labs suggested that people are concerned with the >> >>> memory consumption of struct page (which considers all generic >> >>> scenarios for the kernel). This leads to a possible solution that, >> >>> instead of reusing Linux struct page and ZONE_DEVICE mechanism, GMEM >> >>> can implement an isolated buddy allocator >> >> for >> >>> the device to instantiate and register. The isolation is useful >> >>> because device DRAM physical address space is independent. >> >>> Furthermore, the isolated buddy allocator can utilize a customized >> >>> struct page that consumes less memory. It is worth discussing if >> >>> accelerator vendors desire this solution. >> >>> >> >>> MMU functions >> >>> The MMU functions peer_map() and peer_unmap() overlap other >> >>> functions, leaving a question if the MMU functions should be >> >>> decoupled as more basic operations. Decoupling them could >> >>> potentially prevent device drivers coalescing these basic steps >> >>> within a single host-device communication operation, while coupling >> >>> them makes it more difficult for device drivers to utilize GMEM interface. >> >>> >> >>> The idea of GMEM was originated from Weixi's PhD study with Prof. >> >>> Scott Rixner and Prof. Alan L. Cox at Rice University. >> >>> >> >>> [1] https://arxiv.org/abs/2310.12554. >> >>> >> >>> Weixi Zhu (6): >> >>> mm/gmem: add heterogeneous NUMA node >> >>> mm/gmem: add arch-independent abstraction to track address mapping >> >>> status >> >>> mm/gmem: add GMEM (Generalized Memory Management) interface for >> >>> external accelerators >> >>> mm/gmem: add new syscall hmadvise() to issue memory hints for >> >>> heterogeneous NUMA nodes >> >>> mm/gmem: resolve VMA conflicts for attached peer devices >> >>> mm/gmem: extending Linux core MM to support unified virtual address >> >>> space >> >>> >> >>> arch/arm64/include/asm/unistd.h | 2 +- >> >>> arch/arm64/include/asm/unistd32.h | 2 + >> >>> drivers/base/node.c | 6 + >> >>> fs/proc/task_mmu.c | 3 + >> >>> include/linux/gmem.h | 368 ++++++++++++ >> >>> include/linux/mm.h | 8 + >> >>> include/linux/mm_types.h | 5 + >> >>> include/linux/nodemask.h | 10 + >> >>> include/uapi/asm-generic/mman-common.h | 4 + >> >>> include/uapi/asm-generic/unistd.h | 5 +- >> >>> init/main.c | 2 + >> >>> kernel/fork.c | 5 + >> >>> kernel/sys_ni.c | 2 + >> >>> mm/Kconfig | 14 + >> >>> mm/Makefile | 1 + >> >>> mm/gmem.c | 746 ++++++++++++++++++++++++ >> >>> mm/huge_memory.c | 85 ++- >> >>> mm/memory.c | 42 +- >> >>> mm/mempolicy.c | 4 + >> >>> mm/mmap.c | 40 +- >> >>> mm/oom_kill.c | 2 + >> >>> mm/page_alloc.c | 3 + >> >>> mm/vm_object.c | 309 ++++++++++ >> >>> tools/include/uapi/asm-generic/unistd.h | 5 +- >> >>> 24 files changed, 1654 insertions(+), 19 deletions(-) >> >>> create mode 100644 include/linux/gmem.h >> >>> create mode 100644 mm/gmem.c >> >>> create mode 100644 mm/vm_object.c >> >>>