On 03/22/2018 05:55 PM, jglisse@xxxxxxxxxx wrote: > From: Ralph Campbell <rcampbell@xxxxxxxxxx> > > This patch updates the documentation for HMM to fix minor typos and > phrasing to be a bit more readable. > > Signed-off-by: Ralph Campbell <rcampbell@xxxxxxxxxx> > Signed-off-by: Jérôme Glisse <jglisse@xxxxxxxxxx> > Cc: Stephen Bates <sbates@xxxxxxxxxxxx> > Cc: Jason Gunthorpe <jgg@xxxxxxxxxxxx> > Cc: Logan Gunthorpe <logang@xxxxxxxxxxxx> > Cc: Evgeny Baskakov <ebaskakov@xxxxxxxxxx> > Cc: Mark Hairgrove <mhairgrove@xxxxxxxxxx> > Cc: John Hubbard <jhubbard@xxxxxxxxxx> > --- > Documentation/vm/hmm.txt | 360 ++++++++++++++++++++++++----------------------- > MAINTAINERS | 1 + > 2 files changed, 187 insertions(+), 174 deletions(-) > > diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt > index 4d3aac9f4a5d..e99b97003982 100644 > --- a/Documentation/vm/hmm.txt > +++ b/Documentation/vm/hmm.txt > @@ -1,151 +1,159 @@ > Heterogeneous Memory Management (HMM) > > -Transparently allow any component of a program to use any memory region of said > -program with a device without using device specific memory allocator. This is > -becoming a requirement to simplify the use of advance heterogeneous computing > -where GPU, DSP or FPGA are use to perform various computations. > - > -This document is divided as follow, in the first section i expose the problems > -related to the use of a device specific allocator. The second section i expose > -the hardware limitations that are inherent to many platforms. The third section > -gives an overview of HMM designs. The fourth section explains how CPU page- > -table mirroring works and what is HMM purpose in this context. Fifth section > -deals with how device memory is represented inside the kernel. Finaly the last > -section present the new migration helper that allow to leverage the device DMA > -engine. > - > - > -1) Problems of using device specific memory allocator: > -2) System bus, device memory characteristics > -3) Share address space and migration > +Provide infrastructure and helpers to integrate non conventional memory (device non-conventional > +memory like GPU on board memory) into regular kernel code path. Corner stone of path, with the cornerstone of > +this being specialize struct page for such memory (see sections 5 to 7 of this specialized > +document). > + > +HMM also provide optional helpers for SVM (Share Virtual Memory) ie allowing a provides Memory), i.e., > +device to transparently access program address coherently with the CPU meaning > +that any valid pointer on the CPU is also a valid pointer for the device. This > +is becoming a mandatory to simplify the use of advance heterogeneous computing becoming mandatory advanced > +where GPU, DSP, or FPGA are used to perform various computations on behalf of > +a process. > + > +This document is divided as follows: in the first section I expose the problems > +related to using device specific memory allocators. In the second section, I > +expose the hardware limitations that are inherent to many platforms. The third > +section gives an overview of the HMM design. The fourth section explains how > +CPU page-table mirroring works and what is HMM's purpose in this context. The and the purpose of HMM in this context. > +fifth section deals with how device memory is represented inside the kernel. > +Finally, the last section presents a new migration helper that allows lever- > +aging the device DMA engine. > + > + > +1) Problems of using a device specific memory allocator: > +2) I/O bus, device memory characteristics > +3) Shared address space and migration > 4) Address space mirroring implementation and API > 5) Represent and manage device memory from core kernel point of view > -6) Migrate to and from device memory > +6) Migration to and from device memory > 7) Memory cgroup (memcg) and rss accounting > > > ------------------------------------------------------------------------------- > > -1) Problems of using device specific memory allocator: > +1) Problems of using a device specific memory allocator: > > -Device with large amount of on board memory (several giga bytes) like GPU have > -historically manage their memory through dedicated driver specific API. This > -creates a disconnect between memory allocated and managed by device driver and > -regular application memory (private anonymous, share memory or regular file > -back memory). From here on i will refer to this aspect as split address space. > -I use share address space to refer to the opposite situation ie one in which > -any memory region can be use by device transparently. > +Devices with a large amount of on board memory (several giga bytes) like GPUs gigabytes) > +have historically managed their memory through dedicated driver specific APIs. > +This creates a disconnect between memory allocated and managed by a device > +driver and regular application memory (private anonymous, shared memory, or > +regular file backed memory). From here on I will refer to this aspect as split > +address space. I use shared address space to refer to the opposite situation: > +i.e., one in which any application memory region can be used by a device > +transparently. > > Split address space because device can only access memory allocated through the Awkward sentence: maybe: Split address space happens because > -device specific API. This imply that all memory object in a program are not > -equal from device point of view which complicate large program that rely on a > -wide set of libraries. > +device specific API. This implies that all memory objects in a program are not > +equal from the device point of view which complicates large programs that rely > +on a wide set of libraries. > > -Concretly this means that code that wants to leverage device like GPU need to > +Concretly this means that code that wants to leverage devices like GPUs need to Concretely needs > copy object between genericly allocated memory (malloc, mmap private/share/) object [or an object] between generically > and memory allocated through the device driver API (this still end up with an ends up > mmap but of the device file). > > -For flat dataset (array, grid, image, ...) this isn't too hard to achieve but > -complex data-set (list, tree, ...) are hard to get right. Duplicating a complex > -data-set need to re-map all the pointer relations between each of its elements. > -This is error prone and program gets harder to debug because of the duplicate > -data-set. > +For flat data-sets (array, grid, image, ...) this isn't too hard to achieve but data sets > +complex data-sets (list, tree, ...) are hard to get right. Duplicating a data sets > +complex data-set needs to re-map all the pointer relations between each of its data set > +elements. This is error prone and program gets harder to debug because of the > +duplicate data-set and addresses. data set > > -Split address space also means that library can not transparently use data they > -are getting from core program or other library and thus each library might have > -to duplicate its input data-set using specific memory allocator. Large project > -suffer from this and waste resources because of the various memory copy. > +Split address space also means that libraries can not transparently use data cannot > +they are getting from the core program or another library and thus each library > +might have to duplicate its input data-set using the device specific memory data set > +allocator. Large projects suffer from this and waste resources because of the > +various memory copies. > > Duplicating each library API to accept as input or output memory allocted by allocated > each device specific allocator is not a viable option. It would lead to a > -combinatorial explosions in the library entry points. > +combinatorial explosion in the library entry points. > > -Finaly with the advance of high level language constructs (in C++ but in other > -language too) it is now possible for compiler to leverage GPU or other devices > -without even the programmer knowledge. Some of compiler identified patterns are > -only do-able with a share address. It is as well more reasonable to use a share > -address space for all the other patterns. > +Finally, with the advance of high level language constructs (in C++ but in > +other languages too) it is now possible for the compiler to leverage GPUs and > +other devices without programmer knowledge. Some compiler identified patterns > +are only do-able with a shared address space. It is also more reasonable to use > +a shared address space for all other patterns. > > > ------------------------------------------------------------------------------- > > -2) System bus, device memory characteristics > +2) I/O bus, device memory characteristics > > -System bus cripple share address due to few limitations. Most system bus only > +I/O buses cripple shared address due to few limitations. Most I/O buses only shared address spaces due to a few limitations. > allow basic memory access from device to main memory, even cache coherency is memory; even > -often optional. Access to device memory from CPU is even more limited, most > -often than not it is not cache coherent. > +often optional. Access to device memory from CPU is even more limited. More > +often than not, it is not cache coherent. > > -If we only consider the PCIE bus than device can access main memory (often > -through an IOMMU) and be cache coherent with the CPUs. However it only allows > -a limited set of atomic operation from device on main memory. This is worse > -in the other direction the CPUs can only access a limited range of the device > +If we only consider the PCIE bus, then a device can access main memory (often > +through an IOMMU) and be cache coherent with the CPUs. However, it only allows > +a limited set of atomic operations from device on main memory. This is worse > +in the other direction, the CPU can only access a limited range of the device other direction: > memory and can not perform atomic operations on it. Thus device memory can not cannot cannot > -be consider like regular memory from kernel point of view. > +be considered the same as regular memory from the kernel point of view. > > Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0 > -and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s). > -The final limitation is latency, access to main memory from the device has an > -order of magnitude higher latency than when the device access its own memory. > +and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s). > +The final limitation is latency. Access to main memory from the device has an > +order of magnitude higher latency than when the device accesses its own memory. > > -Some platform are developing new system bus or additions/modifications to PCIE > -to address some of those limitations (OpenCAPI, CCIX). They mainly allow two > +Some platforms are developing new I/O buses or additions/modifications to PCIE > +to address some of these limitations (OpenCAPI, CCIX). They mainly allow two two- > way cache coherency between CPU and device and allow all atomic operations the > -architecture supports. Saddly not all platform are following this trends and > -some major architecture are left without hardware solutions to those problems. > +architecture supports. Saddly, not all platforms are following this trend and Sadly, > +some major architectures are left without hardware solutions to these problems. > > -So for share address space to make sense not only we must allow device to > +So for shared address space to make sense, not only must we allow device to devices to > access any memory memory but we must also permit any memory to be migrated to any memory but > device memory while device is using it (blocking CPU access while it happens). > > > ------------------------------------------------------------------------------- > > -3) Share address space and migration > +3) Shared address space and migration > > HMM intends to provide two main features. First one is to share the address > -space by duplication the CPU page table into the device page table so same > -address point to same memory and this for any valid main memory address in > +space by duplicating the CPU page table in the device page table so the same > +address points to the same physical memory for any valid main memory address in > the process address space. > > -To achieve this, HMM offer a set of helpers to populate the device page table > +To achieve this, HMM offers a set of helpers to populate the device page table > while keeping track of CPU page table updates. Device page table updates are > -not as easy as CPU page table updates. To update the device page table you must > -allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics > -commands in it to perform the update (unmap, cache invalidations and flush, > -...). This can not be done through common code for all device. Hence why HMM > -provides helpers to factor out everything that can be while leaving the gory > -details to the device driver. > - > -The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does > -allow to allocate a struct page for each page of the device memory. Those page > -are special because the CPU can not map them. They however allow to migrate > -main memory to device memory using exhisting migration mechanism and everything > -looks like if page was swap out to disk from CPU point of view. Using a struct > -page gives the easiest and cleanest integration with existing mm mechanisms. > -Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory > -for the device memory and second to perform migration. Policy decision of what > -and when to migrate things is left to the device driver. > - > -Note that any CPU access to a device page trigger a page fault and a migration > -back to main memory ie when a page backing an given address A is migrated from > -a main memory page to a device page then any CPU access to address A trigger a > -page fault and initiate a migration back to main memory. > - > - > -With this two features, HMM not only allow a device to mirror a process address > -space and keeps both CPU and device page table synchronize, but also allow to > -leverage device memory by migrating part of data-set that is actively use by a > -device. > +not as easy as CPU page table updates. To update the device page table, you must > +allocate a buffer (or use a pool of pre-allocated buffers) and write GPU > +specific commands in it to perform the update (unmap, cache invalidations, and > +flush, ...). This can not be done through common code for all devices. Hence cannot > +why HMM provides helpers to factor out everything that can be while leaving the > +hardware specific details to the device driver. > + > +The second mechanism HMM provides, is a new kind of ZONE_DEVICE memory that provides is > +allows allocating a struct page for each page of the device memory. Those pages > +are special because the CPU can not map them. However, they allow migrating cannot > +main memory to device memory using existing migration mechanisms and everything > +looks like a page is swapped out to disk from the CPU point of view. Using a > +struct page gives the easiest and cleanest integration with existing mm mech- > +anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE > +memory for the device memory and second to perform migration. Policy decisions > +of what and when to migrate things is left to the device driver. > + > +Note that any CPU access to a device page triggers a page fault and a migration > +back to main memory. For example, when a page backing a given CPU address A is > +migrated from a main memory page to a device page, then any CPU access to > +address A triggers a page fault and initiates a migration back to main memory. > + > +With these two features, HMM not only allows a device to mirror process address > +space and keeping both CPU and device page table synchronized, but also lever- > +ages device memory by migrating the part of the data-set that is actively being data set > +used by the device. > > > ------------------------------------------------------------------------------- > > 4) Address space mirroring implementation and API > > -Address space mirroring main objective is to allow to duplicate range of CPU > -page table into a device page table and HMM helps keeping both synchronize. A > +Address space mirroring's main objective is to allow duplication of a range of > +CPU page table into a device page table; HMM helps keep both synchronized. A > device driver that want to mirror a process address space must start with the wants > registration of an hmm_mirror struct: > > @@ -155,8 +163,8 @@ device driver that want to mirror a process address space must start with the > struct mm_struct *mm); > > The locked variant is to be use when the driver is already holding the mmap_sem to be used > -of the mm in write mode. The mirror struct has a set of callback that are use > -to propagate CPU page table: > +of the mm in write mode. The mirror struct has a set of callbacks that are used > +to propagate CPU page tables: > > struct hmm_mirror_ops { > /* sync_cpu_device_pagetables() - synchronize page tables > @@ -181,13 +189,13 @@ of the mm in write mode. The mirror struct has a set of callback that are use > unsigned long end); > }; > > -Device driver must perform update to the range following action (turn range > -read only, or fully unmap, ...). Once driver callback returns the device must > -be done with the update. > +The device driver must perform the update action to the range (mark range > +read only, or fully unmap, ...). The device must be done with the update before > +the driver callback returns. > > > -When device driver wants to populate a range of virtual address it can use > -either: > +When the device driver wants to populate a range of virtual addresses, it can > +use either: > int hmm_vma_get_pfns(struct vm_area_struct *vma, > struct hmm_range *range, > unsigned long start, > @@ -201,17 +209,19 @@ When device driver wants to populate a range of virtual address it can use > bool write, > bool block); > > -First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and > -will not trigger a page fault on missing or non present entry. The second one > -do trigger page fault on missing or read only entry if write parameter is true. > -Page fault use the generic mm page fault code path just like a CPU page fault. > +The first one (hmm_vma_get_pfns()) will only fetch present CPU page table > +entries and will not trigger a page fault on missing or non present entries. non-present > +The second one does trigger a page fault on missing or read only entry if the read-only > +write parameter is true. Page faults use the generic mm page fault code path > +just like a CPU page fault. > > -Both function copy CPU page table into their pfns array argument. Each entry in > -that array correspond to an address in the virtual range. HMM provide a set of > -flags to help driver identify special CPU page table entries. > +Both functions copy CPU page table entries into their pfns array argument. Each > +entry in that array corresponds to an address in the virtual range. HMM > +provides a set of flags to help the driver identify special CPU page table > +entries. > > Locking with the update() callback is the most important aspect the driver must > -respect in order to keep things properly synchronize. The usage pattern is : > +respect in order to keep things properly synchronized. The usage pattern is: > > int driver_populate_range(...) > { > @@ -233,43 +243,44 @@ Locking with the update() callback is the most important aspect the driver must > return 0; > } > > -The driver->update lock is the same lock that driver takes inside its update() > -callback. That lock must be call before hmm_vma_range_done() to avoid any race > -with a concurrent CPU page table update. > +The driver->update lock is the same lock that the driver takes inside its > +update() callback. That lock must be held before hmm_vma_range_done() to avoid > +any race with a concurrent CPU page table update. > > -HMM implements all this on top of the mmu_notifier API because we wanted to a > -simpler API and also to be able to perform optimization latter own like doing > -concurrent device update in multi-devices scenario. > +HMM implements all this on top of the mmu_notifier API because we wanted a > +simpler API and also to be able to perform optimizations latter on like doing > +concurrent device updates in multi-devices scenario. > > -HMM also serve as an impedence missmatch between how CPU page table update are > -done (by CPU write to the page table and TLB flushes) from how device update > -their own page table. Device update is a multi-step process, first appropriate > -commands are write to a buffer, then this buffer is schedule for execution on > -the device. It is only once the device has executed commands in the buffer that > -the update is done. Creating and scheduling update command buffer can happen > -concurrently for multiple devices. Waiting for each device to report commands > -as executed is serialize (there is no point in doing this concurrently). > +HMM also serves as an impedence mismatch between how CPU page table updates impedance > +are done (by CPU write to the page table and TLB flushes) and how devices > +update their own page table. Device updates are a multi-step process. First, > +appropriate commands are writen to a buffer, then this buffer is scheduled for written > +execution on the device. It is only once the device has executed commands in > +the buffer that the update is done. Creating and scheduling the update command > +buffer can happen concurrently for multiple devices. Waiting for each device to > +report commands as executed is serialized (there is no point in doing this > +concurrently). > > > ------------------------------------------------------------------------------- > > 5) Represent and manage device memory from core kernel point of view > > -Several differents design were try to support device memory. First one use > -device specific data structure to keep information about migrated memory and > -HMM hooked itself in various place of mm code to handle any access to address > -that were back by device memory. It turns out that this ended up replicating > -most of the fields of struct page and also needed many kernel code path to be > -updated to understand this new kind of memory. > +Several different designs were tried to support device memory. First one used > +a device specific data structure to keep information about migrated memory and > +HMM hooked itself in various places of mm code to handle any access to > +addresses that were backed by device memory. It turns out that this ended up > +replicating most of the fields of struct page and also needed many kernel code > +paths to be updated to understand this new kind of memory. > > -Thing is most kernel code path never try to access the memory behind a page > -but only care about struct page contents. Because of this HMM switchted to > -directly using struct page for device memory which left most kernel code path > -un-aware of the difference. We only need to make sure that no one ever try to > -map those page from the CPU side. > +Most kernel code paths never try to access the memory behind a page > +but only care about struct page contents. Because of this, HMM switched to > +directly using struct page for device memory which left most kernel code paths > +unaware of the difference. We only need to make sure that no one ever tries to > +map those pages from the CPU side. > > -HMM provide a set of helpers to register and hotplug device memory as a new > -region needing struct page. This is offer through a very simple API: > +HMM provides a set of helpers to register and hotplug device memory as a new > +region needing a struct page. This is offered through a very simple API: > > struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, > struct device *device, > @@ -289,18 +300,19 @@ HMM provide a set of helpers to register and hotplug device memory as a new > }; > > The first callback (free()) happens when the last reference on a device page is > -drop. This means the device page is now free and no longer use by anyone. The > -second callback happens whenever CPU try to access a device page which it can > -not do. This second callback must trigger a migration back to system memory. > +dropped. This means the device page is now free and no longer used by anyone. > +The second callback happens whenever the CPU tries to access a device page > +which it can not do. This second callback must trigger a migration back to cannot > +system memory. > > > ------------------------------------------------------------------------------- > > -6) Migrate to and from device memory > +6) Migration to and from device memory > > -Because CPU can not access device memory, migration must use device DMA engine > -to perform copy from and to device memory. For this we need a new migration > -helper: > +Because the CPU can not access device memory, migration must use the device DMA cannot > +engine to perform copy from and to device memory. For this we need a new > +migration helper: > > int migrate_vma(const struct migrate_vma_ops *ops, > struct vm_area_struct *vma, > @@ -311,15 +323,15 @@ to perform copy from and to device memory. For this we need a new migration > unsigned long *dst, > void *private); > > -Unlike other migration function it works on a range of virtual address, there > -is two reasons for that. First device DMA copy has a high setup overhead cost > +Unlike other migration functions it works on a range of virtual address, there > +are two reasons for that. First, device DMA copy has a high setup overhead cost > and thus batching multiple pages is needed as otherwise the migration overhead > -make the whole excersie pointless. The second reason is because driver trigger > -such migration base on range of address the device is actively accessing. > +makes the whole exersize pointless. The second reason is because the exercise > +migration might be for a range of addresses the device is actively accessing. > > -The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy()) > -control destination memory allocation and copy operation. Second one is there > -to allow device driver to perform cleanup operation after migration. > +The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy()) > +controls destination memory allocation and copy operation. Second one is there > +to allow the device driver to perform cleanup operations after migration. > > struct migrate_vma_ops { > void (*alloc_and_copy)(struct vm_area_struct *vma, > @@ -336,19 +348,19 @@ to allow device driver to perform cleanup operation after migration. > void *private); > }; > > -It is important to stress that this migration helpers allow for hole in the > +It is important to stress that these migration helpers allow for holes in the > virtual address range. Some pages in the range might not be migrated for all > -the usual reasons (page is pin, page is lock, ...). This helper does not fail > -but just skip over those pages. > +the usual reasons (page is pinned, page is locked, ...). This helper does not > +fail but just skips over those pages. > > -The alloc_and_copy() might as well decide to not migrate all pages in the > -range (for reasons under the callback control). For those the callback just > -have to leave the corresponding dst entry empty. > +The alloc_and_copy() might decide to not migrate all pages in the > +range (for reasons under the callback control). For those, the callback just > +has to leave the corresponding dst entry empty. > > -Finaly the migration of the struct page might fails (for file back page) for > +Finally, the migration of the struct page might fail (for file backed page) for > various reasons (failure to freeze reference, or update page cache, ...). If > -that happens then the finalize_and_map() can catch any pages that was not > -migrated. Note those page were still copied to new page and thus we wasted > +that happens, then the finalize_and_map() can catch any pages that were not > +migrated. Note those pages were still copied to a new page and thus we wasted > bandwidth but this is considered as a rare event and a price that we are > willing to pay to keep all the code simpler. > > @@ -358,27 +370,27 @@ willing to pay to keep all the code simpler. > 7) Memory cgroup (memcg) and rss accounting > > For now device memory is accounted as any regular page in rss counters (either > -anonymous if device page is use for anonymous, file if device page is use for > -file back page or shmem if device page is use for share memory). This is a > -deliberate choice to keep existing application that might start using device > -memory without knowing about it to keep runing unimpacted. > - > -Drawbacks is that OOM killer might kill an application using a lot of device > -memory and not a lot of regular system memory and thus not freeing much system > -memory. We want to gather more real world experience on how application and > -system react under memory pressure in the presence of device memory before > +anonymous if device page is used for anonymous, file if device page is used for > +file backed page or shmem if device page is used for shared memory). This is a > +deliberate choice to keep existing applications, that might start using device > +memory without knowing about it, running unimpacted. > + > +A Drawback is that the OOM killer might kill an application using a lot of drawback > +device memory and not a lot of regular system memory and thus not freeing much > +system memory. We want to gather more real world experience on how applications > +and system react under memory pressure in the presence of device memory before > deciding to account device memory differently. > > > -Same decision was made for memory cgroup. Device memory page are accounted > +Same decision was made for memory cgroup. Device memory pages are accounted > against same memory cgroup a regular page would be accounted to. This does > simplify migration to and from device memory. This also means that migration > back from device memory to regular memory can not fail because it would cannot > go above memory cgroup limit. We might revisit this choice latter on once we > -get more experience in how device memory is use and its impact on memory > +get more experience in how device memory is used and its impact on memory > resource control. > > > -Note that device memory can never be pin nor by device driver nor through GUP > +Note that device memory can never be pinned by device driver nor through GUP > and thus such memory is always free upon process exit. Or when last reference > -is drop in case of share memory or file back memory. > +is dropped in case of shared memory or file backed memory. -- ~Randy