Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

Alistair Popple <apopple@xxxxxxxxxx> · Mon, 04 Dec 2023 10:32:32 +1100

Christian König <christian.koenig@xxxxxxx> writes:

> Am 01.12.23 um 06:48 schrieb Zeng, Oak:
>> [SNIP]

>> Besides memory eviction/oversubscription, there are a few other pain points when I use hmm:
>>
>> 1) hmm doesn't support file-back memory, so it is hard to share
> memory b/t process in a gpu environment. You mentioned you have a
> plan... How hard is it to support file-backed in your approach?
>
> As hard as it is to support it through HMM. That's what I meant that
> this approach doesn't integrate well, as far as I know the problem
> isn't inside HMM or any other solution but rather in the file system
> layer.

In what way does HMM not support file-backed memory? I was under the
impression that at least hmm_range_fault() does.

 - Alistair

> Regards,
> Christian.
>
>> 2)virtual address range based memory attribute/hint: with hmadvise,
> where do you save the memory attribute of a virtual address range? Do
> you need to extend vm_area_struct to save it? With hmm, we have to
> maintain such information at driver. This ends up with pretty
> complicated logic to split/merge those address range. I know core mm
> has similar logic to split/merge vma...
>>
>> Oak
>>
>>
>>> -Weixi
>>>
>>> -----Original Message-----
>>> From: Christian König<ckoenig.leichtzumerken@xxxxxxxxx>
>>> Sent: Thursday, November 30, 2023 4:28 PM
>>> To: Zeng, Oak<oak.zeng@xxxxxxxxx>; Christian König
>>> <christian.koenig@xxxxxxx>; zhuweixi<weixi.zhu@xxxxxxxxxx>; linux-
>>> mm@xxxxxxxxx;linux-kernel@xxxxxxxxxxxxxxx;akpm@xxxxxxxxxxxxxxxxxxxx;
>>> Danilo Krummrich<dakr@xxxxxxxxxx>; Dave Airlie<airlied@xxxxxxxxxx>; Daniel
>>> Vetter<daniel@xxxxxxxx>
>>> Cc:intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx;rcampbell@xxxxxxxxxx;
>>> mhairgrove@xxxxxxxxxx;jgg@xxxxxxxxxx;weixi.zhu@xxxxxxxxxxxx;
>>> jhubbard@xxxxxxxxxx;intel-gfx@xxxxxxxxxxxxxxxxxxxxx;apopple@xxxxxxxxxx;
>>> Xinhui.Pan@xxxxxxx;amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
>>> tvrtko.ursulin@xxxxxxxxxxxxxxx;ogabbay@xxxxxxxxxx;jglisse@xxxxxxxxxx; dri-
>>> devel@xxxxxxxxxxxxxxxxxxxxx;ziy@xxxxxxxxxx; Vivi, Rodrigo
>>> <rodrigo.vivi@xxxxxxxxx>;alexander.deucher@xxxxxxx;leonro@xxxxxxxxxx;
>>> Felix.Kuehling@xxxxxxx; Wang, Zhi A<zhi.a.wang@xxxxxxxxx>;
>>> mgorman@xxxxxxx
>>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>>> management) for external memory devices
>>>
>>> Hi Oak,
>>>
>>> yeah, #4 is indeed a really good point and I think Felix will agree to that as well.
>>>
>>> HMM is basically still missing a way to advise device attributes for the CPU
>>> address space. Both migration strategy as well as device specific information (like
>>> cache preferences) fall into this category.
>>>
>>> Since there is a device specific component in those attributes as well I think
>>> device specific IOCTLs still make sense to update them, but HMM should offer
>>> the functionality to manage and store those information.
>>>
>>> Split and merge of VMAs only become a problem if you attach those information
>>> to VMAs, if you keep them completely separate than that doesn't become an
>>> issue either. The down side of this approach is that you don't get automatically
>>> extending attribute ranges for growing VMAs for example.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 29.11.23 um 23:23 schrieb Zeng, Oak:
>>>> Hi Weixi,
>>>>
>>>> Even though Christian has listed reasons rejecting this proposal (yes they are
>>> very reasonable to me), I would open my mind and further explore the possibility
>>> here. Since the current GPU driver uses a hmm based implementation (AMD and
>>> NV has done this; At Intel we are catching up), I want to explore how much we
>>> can benefit from the proposed approach and how your approach can solve some
>>> pain points of our development. So basically what I am questioning here is: what
>>> is the advantage of your approach against hmm.
>>>> To implement a UVM (unified virtual address space b/t cpu and gpu device),
>>> with hmm, driver essentially need to implement below functions:
>>>> 1. device page table update. Your approach requires the same because
>>>> this is device specific codes
>>>>
>>>> 2. Some migration functions to migrate memory b/t system memory and GPU
>>> local memory. My understanding is, even though you generalized this a bit, such
>>> as modified cpu page fault path, provided "general" gm_dev_fault handler... but
>>> device driver still need to provide migration functions because migration
>>> functions have to be device specific (i.e., using device dma/copy engine for
>>> performance purpose). Right?
>>>> 3. GPU physical memory management, this part is now in drm/buddy, shared
>>> by all drivers. I think with your approach, driver still need to provide callback
>>> functions to allocate/free physical pages. Right? Or do you let linux core mm
>>> buddy manage device memory directly?
>>>> 4. madvise/hints/virtual address range management. This has been pain point
>>> for us. Right now device driver has to maintain certain virtual address range data
>>> structure to maintain hints and other virtual address range based memory
>>> attributes. Driver need to sync with linux vma. Driver need to explicitly deal with
>>> range split/merging... HMM doesn't provide support in this area. Your approach
>>> seems cleaner/simpler to me...
>>>>
>>>> So in above, I have examined the some key factors of a gpu UVM memory
>>> manager. I think for #1 and #2, hmm has provide pretty good abstraction/tools
>>> for address space mirroring and migration helpers. For #3, since we have a
>>> common drm/buddy layer, I don't think it is a big problem for driver writer now.
>>>> I do see #4 is something you solved more beautifully, requires new system call
>>> though.
>>>> Oak
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: dri-devel<dri-devel-bounces@xxxxxxxxxxxxxxxxxxxxx>  On Behalf
>>>>> Of Christian König
>>>>> Sent: Tuesday, November 28, 2023 8:09 AM
>>>>> To: Weixi Zhu<weixi.zhu@xxxxxxxxxx>;linux-mm@xxxxxxxxx; linux-
>>>>> kernel@xxxxxxxxxxxxxxx;akpm@xxxxxxxxxxxxxxxxxxxx; Danilo Krummrich
>>>>> <dakr@xxxxxxxxxx>; Dave Airlie<airlied@xxxxxxxxxx>; Daniel Vetter
>>>>> <daniel@xxxxxxxx>
>>>>> Cc:dri-devel@xxxxxxxxxxxxxxxxxxxxx;leonro@xxxxxxxxxx;
>>>>> apopple@xxxxxxxxxx;amd-gfx@xxxxxxxxxxxxxxxxxxxxx;mgorman@xxxxxxx;
>>>>> ziy@xxxxxxxxxx; Wang, Zhi A<zhi.a.wang@xxxxxxxxx>;
>>>>> rcampbell@xxxxxxxxxx;jgg@xxxxxxxxxx;weixi.zhu@xxxxxxxxxxxx;
>>>>> jhubbard@xxxxxxxxxx;intel-gfx@xxxxxxxxxxxxxxxxxxxxx;
>>>>> mhairgrove@xxxxxxxxxx;jglisse@xxxxxxxxxx; Vivi, Rodrigo
>>>>> <rodrigo.vivi@xxxxxxxxx>;intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx;
>>>>> tvrtko.ursulin@xxxxxxxxxxxxxxx;Felix.Kuehling@xxxxxxx;
>>>>> Xinhui.Pan@xxxxxxx;alexander.deucher@xxxxxxx;ogabbay@xxxxxxxxxx
>>>>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>>>>> management) for external memory devices
>>>>>
>>>>> Adding a few missing important people to the explicit to list.
>>>>>
>>>>> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>>>>>> The problem:
>>>>>>
>>>>>> Accelerator driver developers are forced to reinvent external MM
>>>>>> subsystems case by case, because Linux core MM only considers host
>>> memory resources.
>>>>>> These reinvented MM subsystems have similar orders of magnitude of
>>>>>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and
>>>>>> Huawei NPU
>>>>> has
>>>>>> 30K. Meanwhile, more and more vendors are implementing their own
>>>>>> accelerators, e.g. Microsoft's Maia 100. At the same time,
>>>>>> application-level developers suffer from poor programmability --
>>>>>> they must consider parallel address spaces and be careful about the
>>>>>> limited device DRAM capacity. This can be alleviated if a
>>>>>> malloc()-ed virtual address can be shared by the accelerator, or the
>>>>>> abundant host DRAM can further transparently backup the device local
>>> memory.
>>>>>> These external MM systems share similar mechanisms except for the
>>>>>> hardware-dependent part, so reinventing them is effectively
>>>>>> introducing redundant code (14K~70K for each case). Such
>>>>>> developing/maintaining is not cheap. Furthermore, to share a
>>>>>> malloc()-ed virtual address, device drivers need to deeply interact
>>>>>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This
>>>>>> raises the bar for driver development, since developers must
>>>>>> understand how Linux MM works. Further, it creates code maintenance
>>>>>> problems -- any changes to Linux MM potentially require coordinated
>>> changes to accelerator drivers using low-level MM APIs.
>>>>>> Putting a cache-coherent bus between host and device will not make
>>>>>> these external MM subsystems disappear. For example, a
>>>>>> throughput-oriented accelerator will not tolerate executing heavy
>>>>>> memory access workload with a host MMU/IOMMU via a remote bus.
>>>>>> Therefore, devices will still have their own MMU and pick a simpler
>>>>>> page table format for lower address translation overhead, requiring external
>>> MM subsystems.
>>>>>> --------------------
>>>>>>
>>>>>> What GMEM (Generalized Memory Management [1]) does:
>>>>>>
>>>>>> GMEM extends Linux MM to share its machine-independent MM code. Only
>>>>>> high-level interface is provided for device drivers. This prevents
>>>>>> accelerator drivers from reinventing the wheel, but relies on
>>>>>> drivers to implement their hardware-dependent functions declared by
>>>>>> GMEM. GMEM's
>>>>> key
>>>>>> interface include gm_dev_create(), gm_as_create(), gm_as_attach()
>>>>>> and gm_dev_register_physmem(). Here briefly describe how a device
>>>>>> driver utilizes them:
>>>>>> 1. At boot time, call gm_dev_create() and registers the implementation of
>>>>>>       hardware-dependent functions as declared in struct gm_mmu.
>>>>>>         - If the device has local DRAM, call gm_dev_register_physmem() to
>>>>>>           register available physical addresses.
>>>>>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>>>>>>       the current CPU process has been attached to a gmem address space
>>>>>>       (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>>>>>>       to it.
>>>>>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>>>>>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>>>>>>       device computation happens.
>>>>>>
>>>>>> GMEM has changed the following assumptions in Linux MM:
>>>>>>      1. An mm_struct not only handle a single CPU context, but may also handle
>>>>>>         external memory contexts encapsulated as gm_context listed in
>>>>>>         mm->gm_as. An external memory context can include a few or all of the
>>>>>>         following parts: an external MMU (that requires TLB invalidation), an
>>>>>>         external page table (that requires PTE manipulation) and external DRAM
>>>>>>         (that requires physical memory management).
>>>>>>      2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not
>>> necessarily
>>>>>>         mean that a zero-filled physical page should be mapped. The virtual
>>>>>>         page may have been mapped to an external memory device.
>>>>>>      3. Unmapping a page may include sending device TLB invalidation (even if
>>>>>>         its MMU shares CPU page table) and manipulating device PTEs.
>>>>>>
>>>>>> --------------------
>>>>>>
>>>>>> Semantics of new syscalls:
>>>>>>
>>>>>> 1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED)
>>>>>>        Allocate virtual address that is shared between the CPU and all
>>>>>>        attached devices. Data is guaranteed to be coherent whenever the
>>>>>>        address is accessed by either CPU or any attached device. If the device
>>>>>>        does not support page fault, then device driver is responsible for
>>>>>>        faulting memory before data gets accessed. By default, the CPU DRAM is
>>>>>>        can be used as a swap backup for the device local memory.
>>>>>> 2. hmadvise(NUMA_id, va_start, size, memory_hint)
>>>>>>        Issuing memory hint for a given VMA. This extends traditional madvise()
>>>>>>        syscall with an extra argument so that programmers have better control
>>>>>>        with heterogeneous devices registered as NUMA nodes. One
>>>>>> useful
>>>>> memory
>>>>>>        hint could be MADV_PREFETCH, which guarantees that the physical data
>>> of
>>>>>>        the given VMA [VA, VA+size) is migrated to NUMA node #id. Another
>>>>>>        useful memory hint is MADV_DONTNEED. This is helpful to increase
>>> device
>>>>>>        memory utilization. It is worth considering extending the existing
>>>>>>        madvise() syscall with one additional argument.
>>>>>>
>>>>>> --------------------
>>>>>>
>>>>>> Implementation details
>>>>>>
>>>>>> 1. New VMA flag: MAP_PEER_SHARED
>>>>>>
>>>>>> This new flag helps isolate GMEM feature, so that common processes
>>>>>> with no device attached does not need to maintain any logical page
>>>>>> table. It can be deleted if the extra overhead from GMEM is acceptable.
>>>>>>
>>>>>> 2. MMU functions
>>>>>> The device driver must implement the MMU functions declared in
>>>>>> struct gm_mmu.
>>>>>>
>>>>>> VA functions: peer_va_alloc_fixed(), peer_va_free()
>>>>>>
>>>>>> They are used to negotiate a common available VMA between a host
>>>>>> process and a device process at the mmap() time. This is because
>>>>>> some accelerators like Intel Xeon Phi or Huawei's Ascend NPU have
>>>>>> their acceleration tasks executed within a device CPU process
>>>>>> context. Some accelerators may also choose a different format of
>>>>>> virtual address space.
>>>>>>
>>>>>> PA functions: alloc_page(), free_page(), prepare_page()
>>>>>>
>>>>>> Alloc_page() and free_page() are used to allocate and free device
>>>>>> physical pages. Prepare_page() is used to zero-fill or DMA the data
>>>>>> of a physical page. These functions were removed from the submitted
>>>>>> patch, since GMEM does not need to invoke them when testing Huawei's
>>>>>> NPU accelerator. The
>>>>> NPU
>>>>>> accelerator has an OS running in the device that manages the device
>>>>>> physical memory. However, even for such a device it is better for
>>>>>> the host to directly manage device physical memory, which saves
>>>>>> device HBM and avoids synchronizing management status between the host
>>> and device.
>>>>>> Page-table functions:
>>>>>> pmap_create()/destroy()/enter()/release()/protect()
>>>>>>
>>>>>> They are used to create and destroy device page tables, install and
>>>>>> uninstall page table entries and to change the protection of page
>>>>>> table entries.
>>>>>>
>>>>>> TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced()
>>>>>>
>>>>>> They are used to invalidate the TLB entries of a given range of VA
>>>>>> or invalidate a given list of VMAs.
>>>>>>
>>>>>> Wrapper functions: peer_map() and peer_unmap()
>>>>>>
>>>>>> These two functions are used to create or destroy a device mapping
>>>>>> which could include allocating physical memory and copying data.
>>>>>> They effectively wraps the PA functions, Page-table functions and
>>>>>> TLB-invalidation functions. Implementing these steps together allows
>>>>>> devices to optimize the communication cost between host and device.
>>>>>> However, it requires the device driver to correctly order these steps.
>>>>>>
>>>>>> 3. Tracking logical mappings:
>>>>>>
>>>>>> Each process starts maintaining an xarray in
>>>>>> mm->vm_obj->logical_page_table at the first time a host process
>>>>>> calls mmap(MAP_PRIVATE |
>>>>> MAP_PEER_SHARED).
>>>>>> When a virtual page gets touched, its mapping status is created and
>>>>>> stored in struct gm_mapping. The logical page table is utilized to
>>>>>> query the struct gm_mapping given a virtual address. GMEM extends
>>>>>> Linux MM to
>>>>> update
>>>>>> and lookup these logical mappings. For example, in the patch set we
>>>>>> modify the page fault path of to additionally check the logical
>>>>>> mapping of MAP_PEER_SHARED VMAs and identify if a device page should
>>> be migrated.
>>>>>> Similarly, if the device driver wants to resolve a device page fault
>>>>>> or prefetch data, the driver should call gm_dev_fault(). This
>>>>>> function examines the mapping status and determines whether the
>>>>>> device driver should migrate a CPU page to device or install a zero-filled
>>> device page.
>>>>>> The logical mapping abstraction enhances the extensibility of Linux
>>>>>> core MM (a virtual page may be mapped to a device physical page
>>>>>> without any CPU PTE installed). The current implementation is not
>>>>>> complete, since it only focused on anonymous VMAs with
>>>>>> MAP_PEER_SHARED flag. The future plan of logical page table is to
>>>>>> provide a generic abstraction layer that support common anonymous
>>>>>> memory (I am looking at you, transparent huge pages)
>>>>> and
>>>>>> file-backed memory.
>>>>>>
>>>>>> --------------------
>>>>>>
>>>>>> Use cases
>>>>>>
>>>>>> GMEM has been tested over Huawei's NPU (neural process unit) device
>>> driver.
>>>>>> The original NPU device driver has approximately 30,000 lines of
>>>>>> code for memory management. On the contrary, the GMEM-based one has
>>>>>> less than 30 lines of code calling GMEM API, with approximately
>>>>>> 3,700 lines of code implementing the MMU functions. This effectively
>>>>>> saves over 26,200 lines of MM code for one driver. Therefore,
>>>>>> developers from accelerator vendors, including Nvidia, AMD, Intel
>>>>>> and other companies are welcome to discuss if GMEM could be helpful.
>>>>>>
>>>>>> Using GMEM-based driver, it is possible to write a C-style
>>>>>> accelerator code with malloc(), whose underlying mmap() syscall
>>>>>> should include MAP_PEER_SHARED according to current GMEM
>>>>>> implementation. Importantly,
>>>>> GMEM
>>>>>> guarantees a coherent view of memory between the host and all
>>>>>> attached devices. This means that any data written by the CPU or any
>>>>>> attached accelerator can be seen by the next memory load instruction
>>>>>> issued by any attached accelerator or the CPU. Furthermore, the NPU
>>>>>> device was able to oversubscribe memory by swapping memory to host
>>>>>> DDR. Note that this
>>>>> memory
>>>>>> oversubscription mechanism can be universal if the physical memory
>>>>>> management is provided by GMEM. Other potential use cases of GMEM
>>>>>> could include the IOMMU driver, KVM and RDMA drivers, as long as the
>>>>>> device needs to manage external memory resources like VMAs, MMUs or
>>> local DRAMs.
>>>>>> --------------------
>>>>>>
>>>>>> Discussion
>>>>>>
>>>>>> Physical memory management
>>>>>> Most accelerators require the host OS to manage device DRAM. Even
>>>>>> accelerators capable of running an OS inside the driver can benefit
>>>>>> from it, since it helps avoid synchronizing management status
>>>>>> between the host and device. In Linux OSS EU summit 2023, Hannes
>>>>>> Reinecke from SUSE Labs suggested that people are concerned with the
>>>>>> memory consumption of struct page (which considers all generic
>>>>>> scenarios for the kernel). This leads to a possible solution that,
>>>>>> instead of reusing Linux struct page and ZONE_DEVICE mechanism, GMEM
>>>>>> can implement an isolated buddy allocator
>>>>> for
>>>>>> the device to instantiate and register. The isolation is useful
>>>>>> because device DRAM physical address space is independent.
>>>>>> Furthermore, the isolated buddy allocator can utilize a customized
>>>>>> struct page that consumes less memory. It is worth discussing if
>>>>>> accelerator vendors desire this solution.
>>>>>>
>>>>>> MMU functions
>>>>>> The MMU functions peer_map() and peer_unmap() overlap other
>>>>>> functions, leaving a question if the MMU functions should be
>>>>>> decoupled as more basic operations. Decoupling them could
>>>>>> potentially prevent device drivers coalescing these basic steps
>>>>>> within a single host-device communication operation, while coupling
>>>>>> them makes it more difficult for device drivers to utilize GMEM interface.
>>>>>>
>>>>>> The idea of GMEM was originated from Weixi's PhD study with Prof.
>>>>>> Scott Rixner and Prof. Alan L. Cox at Rice University.
>>>>>>
>>>>>> [1]https://arxiv.org/abs/2310.12554.
>>>>>>
>>>>>> Weixi Zhu (6):
>>>>>>      mm/gmem: add heterogeneous NUMA node
>>>>>>      mm/gmem: add arch-independent abstraction to track address mapping
>>>>>>        status
>>>>>>      mm/gmem: add GMEM (Generalized Memory Management) interface for
>>>>>>        external accelerators
>>>>>>      mm/gmem: add new syscall hmadvise() to issue memory hints for
>>>>>>        heterogeneous NUMA nodes
>>>>>>      mm/gmem: resolve VMA conflicts for attached peer devices
>>>>>>      mm/gmem: extending Linux core MM to support unified virtual address
>>>>>>        space
>>>>>>
>>>>>>     arch/arm64/include/asm/unistd.h         |   2 +-
>>>>>>     arch/arm64/include/asm/unistd32.h       |   2 +
>>>>>>     drivers/base/node.c                     |   6 +
>>>>>>     fs/proc/task_mmu.c                      |   3 +
>>>>>>     include/linux/gmem.h                    | 368 ++++++++++++
>>>>>>     include/linux/mm.h                      |   8 +
>>>>>>     include/linux/mm_types.h                |   5 +
>>>>>>     include/linux/nodemask.h                |  10 +
>>>>>>     include/uapi/asm-generic/mman-common.h  |   4 +
>>>>>>     include/uapi/asm-generic/unistd.h       |   5 +-
>>>>>>     init/main.c                             |   2 +
>>>>>>     kernel/fork.c                           |   5 +
>>>>>>     kernel/sys_ni.c                         |   2 +
>>>>>>     mm/Kconfig                              |  14 +
>>>>>>     mm/Makefile                             |   1 +
>>>>>>     mm/gmem.c                               | 746 ++++++++++++++++++++++++
>>>>>>     mm/huge_memory.c                        |  85 ++-
>>>>>>     mm/memory.c                             |  42 +-
>>>>>>     mm/mempolicy.c                          |   4 +
>>>>>>     mm/mmap.c                               |  40 +-
>>>>>>     mm/oom_kill.c                           |   2 +
>>>>>>     mm/page_alloc.c                         |   3 +
>>>>>>     mm/vm_object.c                          | 309 ++++++++++
>>>>>>     tools/include/uapi/asm-generic/unistd.h |   5 +-
>>>>>>     24 files changed, 1654 insertions(+), 19 deletions(-)
>>>>>>     create mode 100644 include/linux/gmem.h
>>>>>>     create mode 100644 mm/gmem.c
>>>>>>     create mode 100644 mm/vm_object.c
>>>>>>