Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

Alistair Popple <apopple@xxxxxxxxxx> · Fri, 01 Dec 2023 17:11:50 +1100

"Zeng, Oak" <oak.zeng@xxxxxxxxx> writes:

> See inline comments
>
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
>> zhuweixi
>> Sent: Thursday, November 30, 2023 5:48 AM
>> To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Zeng, Oak
>> <oak.zeng@xxxxxxxxx>; Christian König <christian.koenig@xxxxxxx>; linux-
>> mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
>> Danilo Krummrich <dakr@xxxxxxxxxx>; Dave Airlie <airlied@xxxxxxxxxx>; Daniel
>> Vetter <daniel@xxxxxxxx>
>> Cc: tvrtko.ursulin@xxxxxxxxxxxxxxx; rcampbell@xxxxxxxxxx; apopple@xxxxxxxxxx;
>> ziy@xxxxxxxxxx; weixi.zhu@xxxxxxxxxxxx; jhubbard@xxxxxxxxxx; intel-
>> gfx@xxxxxxxxxxxxxxxxxxxxx; mhairgrove@xxxxxxxxxx; Wang, Zhi A
>> <zhi.a.wang@xxxxxxxxx>; Xinhui.Pan@xxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
>> jglisse@xxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; jgg@xxxxxxxxxx; Vivi,
>> Rodrigo <rodrigo.vivi@xxxxxxxxx>; alexander.deucher@xxxxxxx;
>> Felix.Kuehling@xxxxxxx; intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx;
>> ogabbay@xxxxxxxxxx; leonro@xxxxxxxxxx; mgorman@xxxxxxx
>> Subject: RE: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> management) for external memory devices
>> 
>> Glad to know that there is a common demand for a new syscall like hmadvise(). I
>> expect it would also be useful for homogeneous NUMA cases. Credits to
>> cudaMemAdvise() API which brought this idea to GMEM's design.
>> 
>> To answer @Oak's questions about GMEM vs. HMM,
>> 
>> Here is the major difference:
>>   GMEM's main target is to stop drivers from reinventing MM code, while
>> HMM/MMU notifiers provide a compatible struct page solution and a
>> coordination mechanism for existing device driver MMs that requires adding
>> extra code to interact with CPU MM.
>> 
>> A straightforward qualitative result for the main target: after integrating Huawei's
>> Ascend NPU driver with GMEM's interface, 30,000 lines of MM code were cut,
>> leaving <100 lines invoking GMEM interface and 3,700 lines implementing vendor-
>> specific functions. Some code from the 3,700 lines should be further moved to
>> GMEM as a generalized feature like device memory oversubscription, but not
>> included in this RFC patch yet.
>> 
>> A list of high-level differences:
>>   1. With HMM/MMU notifiers, drivers need to first implement a full MM
>> subsystem. With GMEM, drivers can reuse Linux's core MM.
>
> A full mm subsystem essentially has below functions:
>
> Physical memory management: neither your approach nor hmm-based
> solution provide device physical memory management. You mentioned you
> have a plan but at least for now driver need to mange device physical
> memory.
>
> Virtual address space management: both approach leverage linux core mm, vma for this.
>
> Data eviction, migration: with hmm, driver need to implement this. It
> is not clear whether gmem has this function. I guess even gmem has it,
> it might be slow cpu data copy, compared to modern gpu's fast data
> copy engine.
>
> Device page table update, va-pa mapping: I think it is driver's responsibility in both approach.
>
> So from the point of re-use core MM, I don't see big difference. Maybe
> you did it more elegantly. I think it is very possible with your
> approach driver can be simpler, less codes.
>
>> 
>>   2. HMM encodes device mapping information in the CPU arch-dependent PTEs,
>> while GMEM proposes an abstraction layer in vm_object. Since GMEM's
>> approach further decouples the arch-related stuff, drivers do not need to
>> implement separate code for X86/ARM and etc.
>
> I don't understand this...with hmm, when a virtual address range's
> backing store is in device memory, cpu pte is encoded to point to
> device memory. Device page table is also encoded to point to the same
> device memory location. But since device memory is not accessible to
> CPU (DEVICE_PRIVATE), so when cpu access this virtual address, there
> is a cpu page fault. Device mapping info is still in device page
> table, not in cpu ptes.
>
> I do not see with hmm why driver need to implement x86/arm
> code... driver only take cares of device page table. Hmm/core mm take
> care of cpu page table, right?

I see our replies have crossed, but that is my understanding as well.

>> 
>>   3. MMU notifiers register hooks at certain core MM events, while GMEM
>> declares basic functions and internally invokes them. GMEM requires less from
>> the driver side -- no need to understand what core MM behaves at certain MMU
>> events. GMEM also expects fewer bugs than MMU notifiers: implementing basic
>> operations with standard declarations vs. implementing whatever random device
>> MM logic in MMU notifiers.
>
> This seems true to me. I feel the mmu notifier thing, especially the
> synchronization/lock design (those sequence numbers, interacting with
> driver lock, and the mmap lock) are very complicated. I indeed spent
> time to understand the specification documented in hmm.rst...

No argument there, but I think that's something we could look at
providing an improved interface for. I don't think it needs a whole new
subsystem to fix. Probably just a version of hmm_range_fault() that
takes the lock and sets up a MMU notifier itself.

I do think there is value in getting notified when core MM programs new
PTEs though as it would avoid expensive device faults. That's something
there is currently no way of doing.

> Your approach seems better.
>
>> 
>>   4. GMEM plans to support a more lightweight physical memory management.
>> The discussion about this part can be found in my cover letter. The question is
>> whether struct page should be compatible (directly use HMM's ZONE_DEVICE
>> solution) or a trimmed, smaller struct page that satisfies generalized demands
>> from accelerators is more preferrable?
>> 
>>   5. GMEM has been demonstrated to allow device memory oversubscription (a
>> GMEM-based 32GB NPU card can run a GPT model oversubscribing 500GB host
>> DDR), while drivers using HMM/MMU notifier must implement this logic one by
>> one. I will submit this part in a future RFC patch.
>
> When device memory is oversubscribed, do you call a driver callback
> function to evict device memory to system memory? Or just cpu copy?
> Copy with device's fast copy engine is faster.
>
> I can see even though with both approach we need to implement a driver
> copy function, with your approach, the driver logic can be
> simplified. With today's drm/ttm, I do see the logic in the memory
> eviction area is very complicated. Those eviction fence (some call it
> suspend fence), dma-fence enable signalling....very complicated to me.
>
> Essentially evict device memory to system memory is nothing different
> from evict system memory to disk... so if your approach can leverage
> some linux core mm eviction logic, I do see it can simplify things
> here...
>
>> 
>> I want to reiterate that GMEM's shared address space support is a bonus result,
>> not a main contribution... It was done because it was not difficult to implement
>> internal CPU-device coordination mechanism when core MM is extended by
>> GMEM to support devices.
>
> Besides memory eviction/oversubscription, there are a few other pain points when I use hmm:
>
> 1) hmm doesn't support file-back memory, so it is hard to share memory
> b/t process in a gpu environment. You mentioned you have a plan... How
> hard is it to support file-backed in your approach?
> 2)virtual address range based memory attribute/hint: with hmadvise,
> where do you save the memory attribute of a virtual address range? Do
> you need to extend vm_area_struct to save it? With hmm, we have to
> maintain such information at driver. This ends up with pretty
> complicated logic to split/merge those address range. I know core mm
> has similar logic to split/merge vma...
>
> Oak
>
>
>> 
>> -Weixi
>> 
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
>> Sent: Thursday, November 30, 2023 4:28 PM
>> To: Zeng, Oak <oak.zeng@xxxxxxxxx>; Christian König
>> <christian.koenig@xxxxxxx>; zhuweixi <weixi.zhu@xxxxxxxxxx>; linux-
>> mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
>> Danilo Krummrich <dakr@xxxxxxxxxx>; Dave Airlie <airlied@xxxxxxxxxx>; Daniel
>> Vetter <daniel@xxxxxxxx>
>> Cc: intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx; rcampbell@xxxxxxxxxx;
>> mhairgrove@xxxxxxxxxx; jgg@xxxxxxxxxx; weixi.zhu@xxxxxxxxxxxx;
>> jhubbard@xxxxxxxxxx; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; apopple@xxxxxxxxxx;
>> Xinhui.Pan@xxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
>> tvrtko.ursulin@xxxxxxxxxxxxxxx; ogabbay@xxxxxxxxxx; jglisse@xxxxxxxxxx; dri-
>> devel@xxxxxxxxxxxxxxxxxxxxx; ziy@xxxxxxxxxx; Vivi, Rodrigo
>> <rodrigo.vivi@xxxxxxxxx>; alexander.deucher@xxxxxxx; leonro@xxxxxxxxxx;
>> Felix.Kuehling@xxxxxxx; Wang, Zhi A <zhi.a.wang@xxxxxxxxx>;
>> mgorman@xxxxxxx
>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> management) for external memory devices
>> 
>> Hi Oak,
>> 
>> yeah, #4 is indeed a really good point and I think Felix will agree to that as well.
>> 
>> HMM is basically still missing a way to advise device attributes for the CPU
>> address space. Both migration strategy as well as device specific information (like
>> cache preferences) fall into this category.
>> 
>> Since there is a device specific component in those attributes as well I think
>> device specific IOCTLs still make sense to update them, but HMM should offer
>> the functionality to manage and store those information.
>> 
>> Split and merge of VMAs only become a problem if you attach those information
>> to VMAs, if you keep them completely separate than that doesn't become an
>> issue either. The down side of this approach is that you don't get automatically
>> extending attribute ranges for growing VMAs for example.
>> 
>> Regards,
>> Christian.
>> 
>> Am 29.11.23 um 23:23 schrieb Zeng, Oak:
>> > Hi Weixi,
>> >
>> > Even though Christian has listed reasons rejecting this proposal (yes they are
>> very reasonable to me), I would open my mind and further explore the possibility
>> here. Since the current GPU driver uses a hmm based implementation (AMD and
>> NV has done this; At Intel we are catching up), I want to explore how much we
>> can benefit from the proposed approach and how your approach can solve some
>> pain points of our development. So basically what I am questioning here is: what
>> is the advantage of your approach against hmm.
>> >
>> > To implement a UVM (unified virtual address space b/t cpu and gpu device),
>> with hmm, driver essentially need to implement below functions:
>> >
>> > 1. device page table update. Your approach requires the same because
>> > this is device specific codes
>> >
>> > 2. Some migration functions to migrate memory b/t system memory and GPU
>> local memory. My understanding is, even though you generalized this a bit, such
>> as modified cpu page fault path, provided "general" gm_dev_fault handler... but
>> device driver still need to provide migration functions because migration
>> functions have to be device specific (i.e., using device dma/copy engine for
>> performance purpose). Right?
>> >
>> > 3. GPU physical memory management, this part is now in drm/buddy, shared
>> by all drivers. I think with your approach, driver still need to provide callback
>> functions to allocate/free physical pages. Right? Or do you let linux core mm
>> buddy manage device memory directly?
>> >
>> > 4. madvise/hints/virtual address range management. This has been pain point
>> for us. Right now device driver has to maintain certain virtual address range data
>> structure to maintain hints and other virtual address range based memory
>> attributes. Driver need to sync with linux vma. Driver need to explicitly deal with
>> range split/merging... HMM doesn't provide support in this area. Your approach
>> seems cleaner/simpler to me...
>> >
>> >
>> > So in above, I have examined the some key factors of a gpu UVM memory
>> manager. I think for #1 and #2, hmm has provide pretty good abstraction/tools
>> for address space mirroring and migration helpers. For #3, since we have a
>> common drm/buddy layer, I don't think it is a big problem for driver writer now.
>> >
>> > I do see #4 is something you solved more beautifully, requires new system call
>> though.
>> >
>> > Oak
>> >
>> >
>> >> -----Original Message-----
>> >> From: dri-devel <dri-devel-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf
>> >> Of Christian König
>> >> Sent: Tuesday, November 28, 2023 8:09 AM
>> >> To: Weixi Zhu <weixi.zhu@xxxxxxxxxx>; linux-mm@xxxxxxxxx; linux-
>> >> kernel@xxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; Danilo Krummrich
>> >> <dakr@xxxxxxxxxx>; Dave Airlie <airlied@xxxxxxxxxx>; Daniel Vetter
>> >> <daniel@xxxxxxxx>
>> >> Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; leonro@xxxxxxxxxx;
>> >> apopple@xxxxxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; mgorman@xxxxxxx;
>> >> ziy@xxxxxxxxxx; Wang, Zhi A <zhi.a.wang@xxxxxxxxx>;
>> >> rcampbell@xxxxxxxxxx; jgg@xxxxxxxxxx; weixi.zhu@xxxxxxxxxxxx;
>> >> jhubbard@xxxxxxxxxx; intel-gfx@xxxxxxxxxxxxxxxxxxxxx;
>> >> mhairgrove@xxxxxxxxxx; jglisse@xxxxxxxxxx; Vivi, Rodrigo
>> >> <rodrigo.vivi@xxxxxxxxx>; intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx;
>> >> tvrtko.ursulin@xxxxxxxxxxxxxxx; Felix.Kuehling@xxxxxxx;
>> >> Xinhui.Pan@xxxxxxx; alexander.deucher@xxxxxxx; ogabbay@xxxxxxxxxx
>> >> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> >> management) for external memory devices
>> >>
>> >> Adding a few missing important people to the explicit to list.
>> >>
>> >> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>> >>> The problem:
>> >>>
>> >>> Accelerator driver developers are forced to reinvent external MM
>> >>> subsystems case by case, because Linux core MM only considers host
>> memory resources.
>> >>> These reinvented MM subsystems have similar orders of magnitude of
>> >>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and
>> >>> Huawei NPU
>> >> has
>> >>> 30K. Meanwhile, more and more vendors are implementing their own
>> >>> accelerators, e.g. Microsoft's Maia 100. At the same time,
>> >>> application-level developers suffer from poor programmability --
>> >>> they must consider parallel address spaces and be careful about the
>> >>> limited device DRAM capacity. This can be alleviated if a
>> >>> malloc()-ed virtual address can be shared by the accelerator, or the
>> >>> abundant host DRAM can further transparently backup the device local
>> memory.
>> >>>
>> >>> These external MM systems share similar mechanisms except for the
>> >>> hardware-dependent part, so reinventing them is effectively
>> >>> introducing redundant code (14K~70K for each case). Such
>> >>> developing/maintaining is not cheap. Furthermore, to share a
>> >>> malloc()-ed virtual address, device drivers need to deeply interact
>> >>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This
>> >>> raises the bar for driver development, since developers must
>> >>> understand how Linux MM works. Further, it creates code maintenance
>> >>> problems -- any changes to Linux MM potentially require coordinated
>> changes to accelerator drivers using low-level MM APIs.
>> >>>
>> >>> Putting a cache-coherent bus between host and device will not make
>> >>> these external MM subsystems disappear. For example, a
>> >>> throughput-oriented accelerator will not tolerate executing heavy
>> >>> memory access workload with a host MMU/IOMMU via a remote bus.
>> >>> Therefore, devices will still have their own MMU and pick a simpler
>> >>> page table format for lower address translation overhead, requiring external
>> MM subsystems.
>> >>>
>> >>> --------------------
>> >>>
>> >>> What GMEM (Generalized Memory Management [1]) does:
>> >>>
>> >>> GMEM extends Linux MM to share its machine-independent MM code. Only
>> >>> high-level interface is provided for device drivers. This prevents
>> >>> accelerator drivers from reinventing the wheel, but relies on
>> >>> drivers to implement their hardware-dependent functions declared by
>> >>> GMEM. GMEM's
>> >> key
>> >>> interface include gm_dev_create(), gm_as_create(), gm_as_attach()
>> >>> and gm_dev_register_physmem(). Here briefly describe how a device
>> >>> driver utilizes them:
>> >>> 1. At boot time, call gm_dev_create() and registers the implementation of
>> >>>      hardware-dependent functions as declared in struct gm_mmu.
>> >>>        - If the device has local DRAM, call gm_dev_register_physmem() to
>> >>>          register available physical addresses.
>> >>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>> >>>      the current CPU process has been attached to a gmem address space
>> >>>      (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>> >>>      to it.
>> >>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>> >>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>> >>>      device computation happens.
>> >>>
>> >>> GMEM has changed the following assumptions in Linux MM:
>> >>>     1. An mm_struct not only handle a single CPU context, but may also handle
>> >>>        external memory contexts encapsulated as gm_context listed in
>> >>>        mm->gm_as. An external memory context can include a few or all of the
>> >>>        following parts: an external MMU (that requires TLB invalidation), an
>> >>>        external page table (that requires PTE manipulation) and external DRAM
>> >>>        (that requires physical memory management).
>> >>>     2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not
>> necessarily
>> >>>        mean that a zero-filled physical page should be mapped. The virtual
>> >>>        page may have been mapped to an external memory device.
>> >>>     3. Unmapping a page may include sending device TLB invalidation (even if
>> >>>        its MMU shares CPU page table) and manipulating device PTEs.
>> >>>
>> >>> --------------------
>> >>>
>> >>> Semantics of new syscalls:
>> >>>
>> >>> 1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED)
>> >>>       Allocate virtual address that is shared between the CPU and all
>> >>>       attached devices. Data is guaranteed to be coherent whenever the
>> >>>       address is accessed by either CPU or any attached device. If the device
>> >>>       does not support page fault, then device driver is responsible for
>> >>>       faulting memory before data gets accessed. By default, the CPU DRAM is
>> >>>       can be used as a swap backup for the device local memory.
>> >>> 2. hmadvise(NUMA_id, va_start, size, memory_hint)
>> >>>       Issuing memory hint for a given VMA. This extends traditional madvise()
>> >>>       syscall with an extra argument so that programmers have better control
>> >>>       with heterogeneous devices registered as NUMA nodes. One
>> >>> useful
>> >> memory
>> >>>       hint could be MADV_PREFETCH, which guarantees that the physical data
>> of
>> >>>       the given VMA [VA, VA+size) is migrated to NUMA node #id. Another
>> >>>       useful memory hint is MADV_DONTNEED. This is helpful to increase
>> device
>> >>>       memory utilization. It is worth considering extending the existing
>> >>>       madvise() syscall with one additional argument.
>> >>>
>> >>> --------------------
>> >>>
>> >>> Implementation details
>> >>>
>> >>> 1. New VMA flag: MAP_PEER_SHARED
>> >>>
>> >>> This new flag helps isolate GMEM feature, so that common processes
>> >>> with no device attached does not need to maintain any logical page
>> >>> table. It can be deleted if the extra overhead from GMEM is acceptable.
>> >>>
>> >>> 2. MMU functions
>> >>> The device driver must implement the MMU functions declared in
>> >>> struct gm_mmu.
>> >>>
>> >>> VA functions: peer_va_alloc_fixed(), peer_va_free()
>> >>>
>> >>> They are used to negotiate a common available VMA between a host
>> >>> process and a device process at the mmap() time. This is because
>> >>> some accelerators like Intel Xeon Phi or Huawei's Ascend NPU have
>> >>> their acceleration tasks executed within a device CPU process
>> >>> context. Some accelerators may also choose a different format of
>> >>> virtual address space.
>> >>>
>> >>> PA functions: alloc_page(), free_page(), prepare_page()
>> >>>
>> >>> Alloc_page() and free_page() are used to allocate and free device
>> >>> physical pages. Prepare_page() is used to zero-fill or DMA the data
>> >>> of a physical page. These functions were removed from the submitted
>> >>> patch, since GMEM does not need to invoke them when testing Huawei's
>> >>> NPU accelerator. The
>> >> NPU
>> >>> accelerator has an OS running in the device that manages the device
>> >>> physical memory. However, even for such a device it is better for
>> >>> the host to directly manage device physical memory, which saves
>> >>> device HBM and avoids synchronizing management status between the host
>> and device.
>> >>>
>> >>> Page-table functions:
>> >>> pmap_create()/destroy()/enter()/release()/protect()
>> >>>
>> >>> They are used to create and destroy device page tables, install and
>> >>> uninstall page table entries and to change the protection of page
>> >>> table entries.
>> >>>
>> >>> TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced()
>> >>>
>> >>> They are used to invalidate the TLB entries of a given range of VA
>> >>> or invalidate a given list of VMAs.
>> >>>
>> >>> Wrapper functions: peer_map() and peer_unmap()
>> >>>
>> >>> These two functions are used to create or destroy a device mapping
>> >>> which could include allocating physical memory and copying data.
>> >>> They effectively wraps the PA functions, Page-table functions and
>> >>> TLB-invalidation functions. Implementing these steps together allows
>> >>> devices to optimize the communication cost between host and device.
>> >>> However, it requires the device driver to correctly order these steps.
>> >>>
>> >>> 3. Tracking logical mappings:
>> >>>
>> >>> Each process starts maintaining an xarray in
>> >>> mm->vm_obj->logical_page_table at the first time a host process
>> >>> calls mmap(MAP_PRIVATE |
>> >> MAP_PEER_SHARED).
>> >>> When a virtual page gets touched, its mapping status is created and
>> >>> stored in struct gm_mapping. The logical page table is utilized to
>> >>> query the struct gm_mapping given a virtual address. GMEM extends
>> >>> Linux MM to
>> >> update
>> >>> and lookup these logical mappings. For example, in the patch set we
>> >>> modify the page fault path of to additionally check the logical
>> >>> mapping of MAP_PEER_SHARED VMAs and identify if a device page should
>> be migrated.
>> >>> Similarly, if the device driver wants to resolve a device page fault
>> >>> or prefetch data, the driver should call gm_dev_fault(). This
>> >>> function examines the mapping status and determines whether the
>> >>> device driver should migrate a CPU page to device or install a zero-filled
>> device page.
>> >>>
>> >>> The logical mapping abstraction enhances the extensibility of Linux
>> >>> core MM (a virtual page may be mapped to a device physical page
>> >>> without any CPU PTE installed). The current implementation is not
>> >>> complete, since it only focused on anonymous VMAs with
>> >>> MAP_PEER_SHARED flag. The future plan of logical page table is to
>> >>> provide a generic abstraction layer that support common anonymous
>> >>> memory (I am looking at you, transparent huge pages)
>> >> and
>> >>> file-backed memory.
>> >>>
>> >>> --------------------
>> >>>
>> >>> Use cases
>> >>>
>> >>> GMEM has been tested over Huawei's NPU (neural process unit) device
>> driver.
>> >>> The original NPU device driver has approximately 30,000 lines of
>> >>> code for memory management. On the contrary, the GMEM-based one has
>> >>> less than 30 lines of code calling GMEM API, with approximately
>> >>> 3,700 lines of code implementing the MMU functions. This effectively
>> >>> saves over 26,200 lines of MM code for one driver. Therefore,
>> >>> developers from accelerator vendors, including Nvidia, AMD, Intel
>> >>> and other companies are welcome to discuss if GMEM could be helpful.
>> >>>
>> >>> Using GMEM-based driver, it is possible to write a C-style
>> >>> accelerator code with malloc(), whose underlying mmap() syscall
>> >>> should include MAP_PEER_SHARED according to current GMEM
>> >>> implementation. Importantly,
>> >> GMEM
>> >>> guarantees a coherent view of memory between the host and all
>> >>> attached devices. This means that any data written by the CPU or any
>> >>> attached accelerator can be seen by the next memory load instruction
>> >>> issued by any attached accelerator or the CPU. Furthermore, the NPU
>> >>> device was able to oversubscribe memory by swapping memory to host
>> >>> DDR. Note that this
>> >> memory
>> >>> oversubscription mechanism can be universal if the physical memory
>> >>> management is provided by GMEM. Other potential use cases of GMEM
>> >>> could include the IOMMU driver, KVM and RDMA drivers, as long as the
>> >>> device needs to manage external memory resources like VMAs, MMUs or
>> local DRAMs.
>> >>>
>> >>> --------------------
>> >>>
>> >>> Discussion
>> >>>
>> >>> Physical memory management
>> >>> Most accelerators require the host OS to manage device DRAM. Even
>> >>> accelerators capable of running an OS inside the driver can benefit
>> >>> from it, since it helps avoid synchronizing management status
>> >>> between the host and device. In Linux OSS EU summit 2023, Hannes
>> >>> Reinecke from SUSE Labs suggested that people are concerned with the
>> >>> memory consumption of struct page (which considers all generic
>> >>> scenarios for the kernel). This leads to a possible solution that,
>> >>> instead of reusing Linux struct page and ZONE_DEVICE mechanism, GMEM
>> >>> can implement an isolated buddy allocator
>> >> for
>> >>> the device to instantiate and register. The isolation is useful
>> >>> because device DRAM physical address space is independent.
>> >>> Furthermore, the isolated buddy allocator can utilize a customized
>> >>> struct page that consumes less memory. It is worth discussing if
>> >>> accelerator vendors desire this solution.
>> >>>
>> >>> MMU functions
>> >>> The MMU functions peer_map() and peer_unmap() overlap other
>> >>> functions, leaving a question if the MMU functions should be
>> >>> decoupled as more basic operations. Decoupling them could
>> >>> potentially prevent device drivers coalescing these basic steps
>> >>> within a single host-device communication operation, while coupling
>> >>> them makes it more difficult for device drivers to utilize GMEM interface.
>> >>>
>> >>> The idea of GMEM was originated from Weixi's PhD study with Prof.
>> >>> Scott Rixner and Prof. Alan L. Cox at Rice University.
>> >>>
>> >>> [1] https://arxiv.org/abs/2310.12554.
>> >>>
>> >>> Weixi Zhu (6):
>> >>>     mm/gmem: add heterogeneous NUMA node
>> >>>     mm/gmem: add arch-independent abstraction to track address mapping
>> >>>       status
>> >>>     mm/gmem: add GMEM (Generalized Memory Management) interface for
>> >>>       external accelerators
>> >>>     mm/gmem: add new syscall hmadvise() to issue memory hints for
>> >>>       heterogeneous NUMA nodes
>> >>>     mm/gmem: resolve VMA conflicts for attached peer devices
>> >>>     mm/gmem: extending Linux core MM to support unified virtual address
>> >>>       space
>> >>>
>> >>>    arch/arm64/include/asm/unistd.h         |   2 +-
>> >>>    arch/arm64/include/asm/unistd32.h       |   2 +
>> >>>    drivers/base/node.c                     |   6 +
>> >>>    fs/proc/task_mmu.c                      |   3 +
>> >>>    include/linux/gmem.h                    | 368 ++++++++++++
>> >>>    include/linux/mm.h                      |   8 +
>> >>>    include/linux/mm_types.h                |   5 +
>> >>>    include/linux/nodemask.h                |  10 +
>> >>>    include/uapi/asm-generic/mman-common.h  |   4 +
>> >>>    include/uapi/asm-generic/unistd.h       |   5 +-
>> >>>    init/main.c                             |   2 +
>> >>>    kernel/fork.c                           |   5 +
>> >>>    kernel/sys_ni.c                         |   2 +
>> >>>    mm/Kconfig                              |  14 +
>> >>>    mm/Makefile                             |   1 +
>> >>>    mm/gmem.c                               | 746 ++++++++++++++++++++++++
>> >>>    mm/huge_memory.c                        |  85 ++-
>> >>>    mm/memory.c                             |  42 +-
>> >>>    mm/mempolicy.c                          |   4 +
>> >>>    mm/mmap.c                               |  40 +-
>> >>>    mm/oom_kill.c                           |   2 +
>> >>>    mm/page_alloc.c                         |   3 +
>> >>>    mm/vm_object.c                          | 309 ++++++++++
>> >>>    tools/include/uapi/asm-generic/unistd.h |   5 +-
>> >>>    24 files changed, 1654 insertions(+), 19 deletions(-)
>> >>>    create mode 100644 include/linux/gmem.h
>> >>>    create mode 100644 mm/gmem.c
>> >>>    create mode 100644 mm/vm_object.c
>> >>>