* Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> [241108 08:57]: > Locking around VMAs is complicated and confusing. While we have a number of > disparate comments scattered around the place, we seem to be reaching a > level of complexity that justifies a serious effort at clearly documenting > how locks are expected to be used when it comes to interacting with > mm_struct and vm_area_struct objects. > > This is especially pertinent as regards the efforts to find sensible > abstractions for these fundamental objects in kernel rust code whose > compiler strictly requires some means of expressing these rules (and > through this expression, self-document these requirements as well as > enforce them). > > The document limits scope to mmap and VMA locks and those that are > immediately adjacent and relevant to them - so additionally covers page > table locking as this is so very closely tied to VMA operations (and relies > upon us handling these correctly). > > The document tries to cover some of the nastier and more confusing edge > cases and concerns especially around lock ordering and page table teardown. > > The document is split between generally useful information for users of mm > interfaces, and separately a section intended for mm kernel developers > providing a discussion around internal implementation details. > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > --- > > REVIEWERS NOTES: > * As before, for convenience, I've uploaded a render of this document to my > website at https://ljs.io/v2/mm/process_addrs > * You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. > > v2: > * Fixed grammar and silly typos in various places. > * Further sharpening up of prose. > * Updated remark about empty -> populated requiring mmap lock not rmap - > this goes for populating _anything_, as we don't want to race the gap > between zap and freeing of page tables which _assumes_ you can't do this. > * Clarified point on installing page table entries with rmap locks only. > * Updated swap_readahead_info and numab state entries to mention other > locks/atomicity as per Kirill. > * Improved description of vma->anon_vma and vma->anon_vma_chain as per > Jann. > * Expanded vma->anon-vma to add more details. > * Various typos/small tweaks via Jann. > * Clarified mremap() higher page table lock requirements as per Jann. > * Clarified that lock_vma_under_rcu() _looks up_ the VMA under RCU as per > Jann. > * Clarified RCU requirement for VMA read lock in VMA lock implementation > detail section as per Jann. > * Removed reference to seqnumber increment on mmap write lock as out of > scope at the moment, and incorrect explanation on this (is intended for > speculation going forward) as per Jann. > * Added filemap.c lock ordering also as per Kirill. > * Made the reference to anon/file-backed interval tree root nodes more > explicit in implementation detail section. > * Added note about `MAP_PRIVATE` being in both anon_vma and i_mmap trees. > * Expanded description of page table folding as per Bagas. > * Added missing details about _traversing_ page tables. > * Added the caveat that we can just go ahead and read higher page table > levels if we are simply _traversing_, but if we are to install page table > locks must be acquired and the read double-checked. > * Corrected the comments about gup-fast - we are simply traversing in > gup-fast, which like other page table traversal logic does not acquire > page table locks, but _also_ does not keep the VMA stable. > * Added more details about PMD/PTE lock acquisition in > __pte__offset_map_lock(). > > v1: > * Removed RFC tag as I think we are iterating towards something workable > and there is interest. > * Cleaned up and sharpened the language, structure and layout. Separated > into top-level details and implementation sections as per Alice. > * Replaced links with rather more readable formatting. > * Improved valid mmap/VMA lock state table. > * Put VMA locks section into the process addresses document as per SJ and > Mike. > * Made clear as to read/write operations against VMA object rather than > userland memory, as per Mike's suggestion, also that it does not refer to > page tables as per Jann. > * Moved note into main section as per Mike's suggestion. > * Fixed grammar mistake as per Mike. > * Converted list-table to table as per Mike. > * Corrected various typos as per Jann, Suren. > * Updated reference to page fault arches as per Jann. > * Corrected mistaken write lock criteria for vm_lock_seq as per Jann. > * Updated vm_pgoff description to reference CONFIG_ARCH_HAS_PTE_SPECIAL as > per Jann. > * Updated write lock to mmap read for vma->numab_state as per Jann. > * Clarified that the write lock is on the mmap and VMA lock at VMA > granularity earlier in description as per Suren. > * Added explicit note at top of VMA lock section to explicitly highlight > VMA lock semantics as per Suren. > * Updated required locking for vma lock fields to N/A to avoid confusion as > per Suren. > * Corrected description of mmap_downgrade() as per Suren. > * Added a note on gate VMAs as per Jann. > * Explained that taking mmap read lock under VMA lock is a bad idea due to > deadlock as per Jann. > * Discussed atomicity in page table operations as per Jann. > * Adapted the well thought out page table locking rules as provided by Jann. > * Added a comment about pte mapping maintaining an RCU read lock. > * Added clarification on moving page tables as informed by Jann's comments > (though it turns out mremap() doesn't necessarily hold all locks if it > can resolve races other ways :) > * Added Jann's diagram showing lock exclusivity characteristics. > https://lore.kernel.org/all/20241107190137.58000-1-lorenzo.stoakes@xxxxxxxxxx/ > > RFC: > https://lore.kernel.org/all/20241101185033.131880-1-lorenzo.stoakes@xxxxxxxxxx/ > > Documentation/mm/process_addrs.rst | 813 +++++++++++++++++++++++++++++ > 1 file changed, 813 insertions(+) > > diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst > index e8618fbc62c9..5aef4fd0e0e9 100644 > --- a/Documentation/mm/process_addrs.rst > +++ b/Documentation/mm/process_addrs.rst > @@ -3,3 +3,816 @@ > ================= > Process Addresses > ================= > + > +.. toctree:: > + :maxdepth: 3 > + > + > +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or > +'VMA's of type :c:struct:`!struct vm_area_struct`. > + > +Each VMA describes a virtually contiguous memory range with identical > +attributes, each described by a :c:struct:`!struct vm_area_struct` > +object. Userland access outside of VMAs is invalid except in the case where an > +adjacent stack VMA could be extended to contain the accessed address. > + > +All VMAs are contained within one and only one virtual address space, described > +by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is, > +threads) which share the virtual address space. We refer to this as the > +:c:struct:`!mm`. > + > +Each mm object contains a maple tree data structure which describes all VMAs > +within the virtual address space. > + > +.. note:: An exception to this is the 'gate' VMA which is provided by > + architectures which use :c:struct:`!vsyscall` and is a global static > + object which does not belong to any specific mm. vvars too? > + > +------- > +Locking > +------- > + > +The kernel is designed to be highly scalable against concurrent read operations > +on VMA **metadata** so a complicated set of locks are required to ensure memory > +corruption does not occur. > + > +.. note:: Locking VMAs for their metadata does not have any impact on the memory > + they describe nor the page tables that map them. > + > +Terminology > +----------- > + > +* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock` > + which locks at a process address space granularity which can be acquired via > + :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants. > +* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves > + as a read/write semaphore in practice. A VMA read lock is obtained via > + :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a > + write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked > + automatically when the mmap write lock is released). To take a VMA write lock > + you **must** have already acquired an :c:func:`!mmap_write_lock`. > +* **rmap locks** - When trying to access VMAs through the reverse mapping via a > + :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object > + (reachable from a folio via :c:member:`!folio->mapping`) VMAs must be stabilised via > + :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for > + anonymous memory and :c:func:`!i_mmap_[try]lock_read` or > + :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these > + locks as the reverse mapping locks, or 'rmap locks' for brevity. > + > +We discuss page table locks separately in the dedicated section below. > + > +The first thing **any** of these locks achieve is to **stabilise** the VMA > +within the MM tree. That is, guaranteeing that the VMA object will not be > +deleted from under you nor modified (except for some specific fields > +described below). > + > +Stabilising a VMA also keeps the address space described by it around. > + > +Using address space locks > +------------------------- > + > +If you want to **read** VMA metadata fields or just keep the VMA stable, you > +must do one of the following: > + > +* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a > + suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when > + you're done with the VMA, *or* > +* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to > + acquire the lock atomically so might fail, in which case fall-back logic is > + required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`, > + *or* > +* Acquire an rmap lock before traversing the locked interval tree (whether > + anonymous or file-backed) to obtain the required VMA. > + > +If you want to **write** VMA metadata fields, then things vary depending on the > +field (we explore each VMA field in detail below). For the majority you must: > + > +* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a > + suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when > + you're done with the VMA, *and* > +* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to > + modify, which will be released automatically when :c:func:`!mmap_write_unlock` is > + called. > +* If you want to be able to write to **any** field, you must also hide the VMA > + from the reverse mapping by obtaining an **rmap write lock**. > + > +VMA locks are special in that you must obtain an mmap **write** lock **first** > +in order to obtain a VMA **write** lock. A VMA **read** lock however can be > +obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then > +release an RCU lock to lookup the VMA for you). This reduces the impact of a writer on readers by only impacting conflicting areas of the vma tree. > + > +.. note:: The primary users of VMA read locks are page fault handlers, which > + means that without a VMA write lock, page faults will run concurrent with > + whatever you are doing. This is the primary user in that it's the most frequent, but as we unwind other lock messes it is becoming a pattern. Maybe "the most frequent users" ? > + > +Examining all valid lock states: > + > +.. table:: > + > + ========= ======== ========= ======= ===== =========== ========== > + mmap lock VMA lock rmap lock Stable? Read? Write most? Write all? > + ========= ======== ========= ======= ===== =========== ========== > + \- \- \- N N N N > + \- R \- Y Y N N > + \- \- R/W Y Y N N > + R/W \-/R \-/R/W Y Y N N > + W W \-/R Y Y Y N > + W W W Y Y Y Y > + ========= ======== ========= ======= ===== =========== ========== > + > +.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock, > + attempting to do the reverse is invalid as it can result in deadlock - if > + another task already holds an mmap write lock and attempts to acquire a VMA > + write lock that will deadlock on the VMA read lock. > + > +All of these locks behave as read/write semaphores in practice, so you can > +obtain either a read or a write lock for each of these. > + > +.. note:: Generally speaking, a read/write semaphore is a class of lock which > + permits concurrent readers. However a write lock can only be obtained > + once all readers have left the critical region (and pending readers > + made to wait). > + > + This renders read locks on a read/write semaphore concurrent with other > + readers and write locks exclusive against all others holding the semaphore. > + > +VMA fields > +^^^^^^^^^^ > + > +We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it > +easier to explore their locking characteristics: > + > +.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these > + are in effect an internal implementation detail. > + > +.. table:: Virtual layout fields > + > + ===================== ======================================== =========== > + Field Description Write lock > + ===================== ======================================== =========== > + :c:member:`!vm_start` Inclusive start virtual address of range mmap write, > + VMA describes. VMA write, > + rmap write. > + :c:member:`!vm_end` Exclusive end virtual address of range mmap write, > + VMA describes. VMA write, > + rmap write. > + :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write, > + the original page offset within the VMA write, > + virtual address space (prior to any rmap write. > + :c:func:`!mremap`), or PFN if a PFN map > + and the architecture does not support > + :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`. > + ===================== ======================================== =========== > + > +These fields describes the size, start and end of the VMA, and as such cannot be > +modified without first being hidden from the reverse mapping since these fields > +are used to locate VMAs within the reverse mapping interval trees. > + > +.. table:: Core fields > + > + ============================ ======================================== ========================= > + Field Description Write lock > + ============================ ======================================== ========================= > + :c:member:`!vm_mm` Containing mm_struct. None - written once on > + initial map. > + :c:member:`!vm_page_prot` Architecture-specific page table mmap write, VMA write. > + protection bits determined from VMA > + flags. > + :c:member:`!vm_flags` Read-only access to VMA flags describing N/A > + attributes of the VMA, in union with > + private writable > + :c:member:`!__vm_flags`. > + :c:member:`!__vm_flags` Private, writable access to VMA flags mmap write, VMA write. > + field, updated by > + :c:func:`!vm_flags_*` functions. > + :c:member:`!vm_file` If the VMA is file-backed, points to a None - written once on > + struct file object describing the initial map. > + underlying file, if anonymous then > + :c:macro:`!NULL`. > + :c:member:`!vm_ops` If the VMA is file-backed, then either None - Written once on > + the driver or file-system provides a initial map by > + :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`. > + object describing callbacks to be > + invoked on VMA lifetime events. > + :c:member:`!vm_private_data` A :c:member:`!void *` field for Handled by driver. > + driver-specific metadata. > + ============================ ======================================== ========================= > + > +These are the core fields which describe the MM the VMA belongs to and its attributes. > + > +.. table:: Config-specific fields > + > + ================================= ===================== ======================================== =============== > + Field Configuration option Description Write lock > + ================================= ===================== ======================================== =============== > + :c:member:`!anon_name` CONFIG_ANON_VMA_NAME A field for storing a mmap write, > + :c:struct:`!struct anon_vma_name` VMA write. > + object providing a name for anonymous > + mappings, or :c:macro:`!NULL` if none > + is set or the VMA is file-backed. These are ref counted and can be shared by more than one vma for scalability. > + :c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism mmap read, > + to perform readahead. This field is swap-specific > + accessed atomically. lock. > + :c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write, > + describes the NUMA behaviour of the VMA write. > + VMA. These are also ref counted for scalability. > + :c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which mmap read, > + describes the current state of numab-specific > + NUMA balancing in relation to this VMA. lock. > + Updated under mmap read lock by > + :c:func:`!task_numa_work`. > + :c:member:`!vm_userfaultfd_ctx` CONFIG_USERFAULTFD Userfaultfd context wrapper object of mmap write, > + type :c:type:`!vm_userfaultfd_ctx`, VMA write. > + either of zero size if userfaultfd is > + disabled, or containing a pointer > + to an underlying > + :c:type:`!userfaultfd_ctx` object which > + describes userfaultfd metadata. > + ================================= ===================== ======================================== =============== > + > +These fields are present or not depending on whether the relevant kernel > +configuration option is set. > + > +.. table:: Reverse mapping fields > + > + =================================== ========================================= ============================ > + Field Description Write lock > + =================================== ========================================= ============================ > + :c:member:`!shared.rb` A red/black tree node used, if the mmap write, VMA write, > + mapping is file-backed, to place the VMA i_mmap write. > + in the > + :c:member:`!struct address_space->i_mmap` > + red/black interval tree. > + :c:member:`!shared.rb_subtree_last` Metadata used for management of the mmap write, VMA write, > + interval tree if the VMA is file-backed. i_mmap write. > + :c:member:`!anon_vma_chain` List of pointers to both forked/CoW’d mmap read, anon_vma write. > + :c:type:`!anon_vma` objects and > + :c:member:`!vma->anon_vma` if it is > + non-:c:macro:`!NULL`. > + :c:member:`!anon_vma` :c:type:`!anon_vma` object used by When :c:macro:`NULL` and > + anonymous folios mapped exclusively to setting non-:c:macro:`NULL`: > + this VMA. Initially set by mmap read, page_table_lock. > + :c:func:`!anon_vma_prepare` serialised > + by the :c:macro:`!page_table_lock`. This When non-:c:macro:`NULL` and > + is set as soon as any page is faulted in. setting :c:macro:`NULL`: > + mmap write, VMA write, > + anon_vma write. > + =================================== ========================================= ============================ > + > +These fields are used to both place the VMA within the reverse mapping, and for > +anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects > +and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should > +reside. > + > +.. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set > + then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap` > + trees at the same time, so all of these fields might be utilised at > + once. > + > +Page tables > +----------- > + > +We won't speak exhaustively on the subject but broadly speaking, page tables map > +virtual addresses to physical ones through a series of page tables, each of > +which contain entries with physical addresses for the next page table level > +(along with flags), and at the leaf level the physical addresses of the > +underlying physical data pages or a special entry such as a swap entry, > +migration entry or other special marker. Offsets into these pages are provided > +by the virtual address itself. > + > +In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge > +pages might eliminate one or two of these levels, but when this is the case we > +typically refer to the leaf level as the PTE level regardless. > + > +.. note:: In instances where the architecture supports fewer page tables than > + five the kernel cleverly 'folds' page table levels, that is stubbing > + out functions related to the skipped levels. This allows us to > + conceptually act is if there were always five levels, even if the > + compiler might, in practice, eliminate any code relating to missing > + ones. > + > +There are free key operations typically performed on page tables: > + > +1. **Traversing** page tables - Simply reading page tables in order to traverse > + them. This only requires that the VMA is kept stable, so a lock which > + establishes this suffices for traversal (there are also lockless variants > + which eliminate even this requirement, such as :c:func:`!gup_fast`). > +2. **Installing** page table mappings - Whether creating a new mapping or > + modifying an existing one. This requires that the VMA is kept stable via an > + mmap or VMA lock (explicitly not rmap locks). > +3. **Zapping/unmapping** page table entries - This is what the kernel calls > + clearing page table mappings at the leaf level only, whilst leaving all page > + tables in place. This is a very common operation in the kernel performed on > + file truncation, the :c:macro:`!MADV_DONTNEED` operation via > + :c:func:`!madvise`, and others. This is performed by a number of functions > + including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages` > + among others. The VMA need only be kept stable for this operation. > +4. **Freeing** page tables - When finally the kernel removes page tables from a > + userland process (typically via :c:func:`!free_pgtables`) extreme care must > + be taken to ensure this is done safely, as this logic finally frees all page > + tables in the specified range, ignoring existing leaf entries (it assumes the > + caller has both zapped the range and prevented any further faults or > + modifications within it). > + > +**Traversing** and **zapping** ranges can be performed holding any one of the > +locks described in the terminology section above - that is the mmap lock, the > +VMA lock or either of the reverse mapping locks. > + > +That is - as long as you keep the relevant VMA **stable** - you are good to go > +ahead and perform these operations on page tables (though internally, kernel > +operations that perform writes also acquire internal page table locks to > +serialise - see the page table implementation detail section for more details). > + > +When **installing** page table entries, the mmap or VMA lock mut be held to keep > +the VMA stable. We explore why this is in the page table locking details section > +below. > + > +**Freeing** page tables is an entirely internal memory management operation and > +has special requirements (see the page freeing section below for more details). > + > +.. warning:: When **freeing** page tables, it must not be possible for VMAs > + containing the ranges those page tables map to be accessible via > + the reverse mapping. > + > + The :c:func:`!free_pgtables` function removes the relevant VMAs > + from the reverse mappings, but no other VMAs can be permitted to be > + accessible and span the specified range. > + > +Lock ordering > +------------- > + > +As we have multiple locks across the kernel which may or may not be taken at the > +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and > +the **order** in which locks are acquired and released becomes very important. > + > +.. note:: Lock inversion occurs when two threads need to acquire multiple locks, > + but in doing so inadvertently cause a mutual deadlock. > + > + For example, consider thread 1 which holds lock A and tries to acquire lock B, > + while thread 2 holds lock B and tries to acquire lock A. > + > + Both threads are now deadlocked on each other. However, had they attempted to > + acquire locks in the same order, one would have waited for the other to > + complete its work and no deadlock would have occurred. > + > +The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required > +ordering of locks within memory management code: > + > +.. code-block:: > + > + inode->i_rwsem (while writing or truncating, not reading or faulting) > + mm->mmap_lock > + mapping->invalidate_lock (in filemap_fault) > + folio_lock > + hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below) > + vma_start_write > + mapping->i_mmap_rwsem > + anon_vma->rwsem > + mm->page_table_lock or pte_lock > + swap_lock (in swap_duplicate, swap_info_get) > + mmlist_lock (in mmput, drain_mmlist and others) > + mapping->private_lock (in block_dirty_folio) > + i_pages lock (widely used) > + lruvec->lru_lock (in folio_lruvec_lock_irq) > + inode->i_lock (in set_page_dirty's __mark_inode_dirty) > + bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) > + sb_lock (within inode_lock in fs/fs-writeback.c) > + i_pages lock (widely used, in set_page_dirty, > + in arch-dependent flush_dcache_mmap_lock, > + within bdi.wb->list_lock in __sync_single_inode) > + > +There is also a file-system specific lock ordering comment located at the top of > +:c:macro:`!mm/filemap.c`: > + > +.. code-block:: > + > + ->i_mmap_rwsem (truncate_pagecache) > + ->private_lock (__free_pte->block_dirty_folio) > + ->swap_lock (exclusive_swap_page, others) > + ->i_pages lock > + > + ->i_rwsem > + ->invalidate_lock (acquired by fs in truncate path) > + ->i_mmap_rwsem (truncate->unmap_mapping_range) > + > + ->mmap_lock > + ->i_mmap_rwsem > + ->page_table_lock or pte_lock (various, mainly in memory.c) > + ->i_pages lock (arch-dependent flush_dcache_mmap_lock) > + > + ->mmap_lock > + ->invalidate_lock (filemap_fault) > + ->lock_page (filemap_fault, access_process_vm) > + > + ->i_rwsem (generic_perform_write) > + ->mmap_lock (fault_in_readable->do_page_fault) > + > + bdi->wb.list_lock > + sb_lock (fs/fs-writeback.c) > + ->i_pages lock (__sync_single_inode) > + > + ->i_mmap_rwsem > + ->anon_vma.lock (vma_merge) > + > + ->anon_vma.lock > + ->page_table_lock or pte_lock (anon_vma_prepare and various) > + > + ->page_table_lock or pte_lock > + ->swap_lock (try_to_unmap_one) > + ->private_lock (try_to_unmap_one) > + ->i_pages lock (try_to_unmap_one) > + ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) > + ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) > + ->private_lock (folio_remove_rmap_pte->set_page_dirty) > + ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) > + bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) > + ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty) > + bdi.wb->list_lock (zap_pte_range->set_page_dirty) > + ->inode->i_lock (zap_pte_range->set_page_dirty) > + ->private_lock (zap_pte_range->block_dirty_folio) > + > +Please check the current state of these comments which may have changed since > +the time of writing of this document. hugetlbfs has its own locking and is out of scope. > + > +------------------------------ > +Locking Implementation Details > +------------------------------ > + > +Page table locking details > +-------------------------- > + > +In addition to the locks described in the terminology section above, we have > +additional locks dedicated to page tables: > + > +* **Higher level page table locks** - Higher level page tables, that is PGD, P4D > + and PUD each make use of the process address space granularity > + :c:member:`!mm->page_table_lock` lock when modified. > + > +* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks > + either kept within the folios describing the page tables or allocated > + separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is > + set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are > + mapped into higher memory (if a 32-bit system) and carefully locked via > + :c:func:`!pte_offset_map_lock`. > + > +These locks represent the minimum required to interact with each page table > +level, but there are further requirements. > + > +Importantly, note that on a **traversal** of page tables, no such locks are > +taken. Whether care is taken on reading the page table entries depends on the > +architecture, see the section on atomicity below. > + > +Locking rules > +^^^^^^^^^^^^^ > + > +We establish basic locking rules when interacting with page tables: > + > +* When changing a page table entry the page table lock for that page table > + **must** be held, except if you can safely assume nobody can access the page > + tables concurrently (such as on invocation of :c:func:`!free_pgtables`). > +* Reads from and writes to page table entries must be *appropriately* > + atomic. See the section on atomicity below for details. > +* Populating previously empty entries requires that the mmap or VMA locks are > + held (read or write), doing so with only rmap locks would be dangerous (see > + the warning below). Which is the rmap lock? It's not listed as rmap lock in the rmap file. > +* As mentioned previously, zapping can be performed while simply keeping the VMA > + stable, that is holding any one of the mmap, VMA or rmap locks. > +* Special care is required for PTEs, as on 32-bit architectures these must be > + mapped into high memory and additionally, careful consideration must be > + applied to racing with THP, migration or other concurrent kernel operations > + that might steal the entire PTE table from under us. All this is handled by > + :c:func:`!pte_offset_map_lock` (see the section on page table installation > + below for more details). > + > +.. warning:: Populating previously empty entries is dangerous as, when unmapping > + VMAs, :c:func:`!vms_clear_ptes` has a window of time between > + zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via > + :c:func:`!free_pgtables`), where the VMA is still visible in the > + rmap tree. :c:func:`!free_pgtables` assumes that the zap has > + already been performed and removes PTEs unconditionally (along with > + all other page tables in the freed range), so installing new PTE > + entries could leak memory and also cause other unexpected and > + dangerous behaviour. > + > +There are additional rules applicable when moving page tables, which we discuss > +in the section on this topic below. > + > +.. note:: Interestingly, :c:func:`!pte_offset_map_lock` holds an RCU read lock > + while the PTE page table lock is held. > + > +Atomicity > +^^^^^^^^^ > + > +Regardless of page table locks, the MMU hardware concurrently updates accessed > +and dirty bits (perhaps more, depending on architecture). Additionally, page > +table traversal operations in parallel (though holding the VMA stable) and > +functionality like GUP-fast locklessly traverses (that is reads) page tables, > +without even keeping the VMA stable at all. > + > +When performing a page table traversal and keeping the VMA stable, whether a > +read must be performed once and only once or not depends on the architecture > +(for instance x86-64 does not require any special precautions). > + > +It is on the write side, or if a read informs whether a write takes place (on an > +installation of a page table entry say, for instance in > +:c:func:`!__pud_install`), where special care must always be taken. In these > +cases we can never assume that page table locks give us entirely exclusive > +access, and must retrieve page table entries once and only once. > + > +If we are reading page table entries, then we need only ensure that the compiler > +does not rearrange our loads. This is achieved via :c:func:`!pXXp_get` > +functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`, > +:c:func:`!pmdp_get`, and :c:func:`!ptep_get`. > + > +Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads > +the page table entry only once. > + > +However, if we wish to manipulate an existing page table entry and care about > +the previously stored data, we must go further and use an hardware atomic > +operation as, for example, in :c:func:`!ptep_get_and_clear`. > + > +Equally, operations that do not rely on the VMA being held stable, such as > +GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like > +:c:func:`!gup_fast_pte_range`), must very carefully interact with page table > +entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for > +higher level page table levels. > + > +Writes to page table entries must also be appropriately atomic, as established > +by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`, > +:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`. > + > +Equally functions which clear page table entries must be appropriately atomic, > +as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`, > +:c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and > +:c:func:`!pte_clear`. > + > +Page table installation > +^^^^^^^^^^^^^^^^^^^^^^^ > + > +Page table installation is performed with the VMA held stable explicitly by an > +mmap or VMA lock in read or write mode (see the warning in the locking rules > +section for details as to why). > + > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above > +PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is > +acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and > +:c:func:`!__pmd_alloc` respectively. > + > +.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and > + :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately > + references the :c:member:`!mm->page_table_lock`. > + > +Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if > +:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD > +physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by > +:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately > +:c:func:`!__pte_alloc`. > + > +Finally, modifying the contents of the PTE requires special treatment, as the > +PTE page table lock must be acquired whenever we want stable and exclusive > +access to entries contained within a PTE, especially when we wish to modify > +them. > + > +This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to > +ensure that the PTE hasn't changed from under us, ultimately invoking > +:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within > +the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock > +must be released via :c:func:`!pte_unmap_unlock`. > + > +.. note:: There are some variants on this, such as > + :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but > + for brevity we do not explore this. See the comment for > + :c:func:`!__pte_offset_map_lock` for more details. > + > +When modifying data in ranges we typically only wish to allocate higher page > +tables as necessary, using these locks to avoid races or overwriting anything, > +and set/clear data at the PTE level as required (for instance when page faulting > +or zapping). > + > +A typical pattern taken when traversing page table entries to install a new > +mapping is to optimistically determine whether the page table entry in the table > +above is empty, if so, only then acquiring the page table lock and checking > +again to see if it was allocated underneath is. > + > +This allows for a traversal with page table locks only being taken when > +required. An example of this is :c:func:`!__pud_alloc`. > + > +At the leaf page table, that is the PTE, we can't entirely rely on this pattern > +as we have separate PMD and PTE locks and a THP collapse for instance might have > +eliminated the PMD entry as well as the PTE from under us. > + > +This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry > +for the PTE, carefully checking it is as expected, before acquiring the > +PTE-specific lock, and then *again* checking that the PMD lock is as expected. > + > +If a THP collapse (or similar) were to occur then the lock on both pages would > +be acquired, so we can ensure this is prevented while the PTE lock is held. > + > +Installing entries this way ensures mutual exclusion on write. > + I stopped here, but missed the v1 comment time so I'm sending this now. ...