The patch titled Subject: docs-mm-add-vma-locks-documentation-v3 has been added to the -mm mm-unstable branch. Its filename is docs-mm-add-vma-locks-documentation-v3.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/docs-mm-add-vma-locks-documentation-v3.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> Subject: docs-mm-add-vma-locks-documentation-v3 Date: Thu, 14 Nov 2024 20:54:01 +0000 Link: https://lkml.kernel.org/r/20241114205402.859737-1-lorenzo.stoakes@xxxxxxxxxx Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> Acked-by: Mike Rapoport (Microsoft) <rppt@xxxxxxxxxx> Acked-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> (for page table locks part) Reviewed-by: Bagas Sanjaya <bagasdotme@xxxxxxxxx> Reviewed-by: Jann Horn <jannh@xxxxxxxxxx> Cc: Alice Ryhl <aliceryhl@xxxxxxxxxx> Cc: Boqun Feng <boqun.feng@xxxxxxxxx> Cc: Hillf Danton <hdanton@xxxxxxxx> Cc: Jonathan Corbet <corbet@xxxxxxx> Cc: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> Cc: SeongJae Park <sj@xxxxxxxxxx> Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx> Cc: Vlastimil Babka <vbabka@xxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- Documentation/mm/process_addrs.rst | 99 +++++++++++++++------------ 1 file changed, 55 insertions(+), 44 deletions(-) --- a/Documentation/mm/process_addrs.rst~docs-mm-add-vma-locks-documentation-v3 +++ a/Documentation/mm/process_addrs.rst @@ -53,7 +53,7 @@ Terminology you **must** have already acquired an :c:func:`!mmap_write_lock`. * **rmap locks** - When trying to access VMAs through the reverse mapping via a :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object - (reachable from a folio via :c:member:`!folio->mapping`) VMAs must be stabilised via + (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for anonymous memory and :c:func:`!i_mmap_[try]lock_read` or :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these @@ -101,6 +101,9 @@ in order to obtain a VMA **write** lock. obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then release an RCU lock to lookup the VMA for you). +This constrains the impact of writers on readers, as a writer can interact with +one VMA while a reader interacts with another simultaneously. + .. note:: The primary users of VMA read locks are page fault handlers, which means that without a VMA write lock, page faults will run concurrent with whatever you are doing. @@ -209,13 +212,17 @@ These are the core fields which describe :c:struct:`!struct anon_vma_name` VMA write. object providing a name for anonymous mappings, or :c:macro:`!NULL` if none - is set or the VMA is file-backed. + is set or the VMA is file-backed. The + underlying object is reference counted + and can be shared across multiple VMAs + for scalability. :c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism mmap read, to perform readahead. This field is swap-specific accessed atomically. lock. :c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write, describes the NUMA behaviour of the VMA write. - VMA. + VMA. The underlying object is reference + counted. :c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which mmap read, describes the current state of numab-specific NUMA balancing in relation to this VMA. lock. @@ -287,7 +294,7 @@ typically refer to the leaf level as the .. note:: In instances where the architecture supports fewer page tables than five the kernel cleverly 'folds' page table levels, that is stubbing out functions related to the skipped levels. This allows us to - conceptually act is if there were always five levels, even if the + conceptually act as if there were always five levels, even if the compiler might, in practice, eliminate any code relating to missing ones. @@ -298,15 +305,16 @@ There are free key operations typically establishes this suffices for traversal (there are also lockless variants which eliminate even this requirement, such as :c:func:`!gup_fast`). 2. **Installing** page table mappings - Whether creating a new mapping or - modifying an existing one. This requires that the VMA is kept stable via an - mmap or VMA lock (explicitly not rmap locks). + modifying an existing one in such a way as to change its identity. This + requires that the VMA is kept stable via an mmap or VMA lock (explicitly not + rmap locks). 3. **Zapping/unmapping** page table entries - This is what the kernel calls clearing page table mappings at the leaf level only, whilst leaving all page tables in place. This is a very common operation in the kernel performed on file truncation, the :c:macro:`!MADV_DONTNEED` operation via :c:func:`!madvise`, and others. This is performed by a number of functions - including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages` - among others. The VMA need only be kept stable for this operation. + including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`. + The VMA need only be kept stable for this operation. 4. **Freeing** page tables - When finally the kernel removes page tables from a userland process (typically via :c:func:`!free_pgtables`) extreme care must be taken to ensure this is done safely, as this logic finally frees all page @@ -314,6 +322,10 @@ There are free key operations typically caller has both zapped the range and prevented any further faults or modifications within it). +.. note:: Modifying mappings for reclaim or migration is performed under rmap + lock as it, like zapping, does not fundamentally modify the identity + of what is being mapped. + **Traversing** and **zapping** ranges can be performed holding any one of the locks described in the terminology section above - that is the mmap lock, the VMA lock or either of the reverse mapping locks. @@ -323,9 +335,9 @@ ahead and perform these operations on pa operations that perform writes also acquire internal page table locks to serialise - see the page table implementation detail section for more details). -When **installing** page table entries, the mmap or VMA lock mut be held to keep -the VMA stable. We explore why this is in the page table locking details section -below. +When **installing** page table entries, the mmap or VMA lock must be held to +keep the VMA stable. We explore why this is in the page table locking details +section below. **Freeing** page tables is an entirely internal memory management operation and has special requirements (see the page freeing section below for more details). @@ -386,50 +398,50 @@ There is also a file-system specific loc .. code-block:: - ->i_mmap_rwsem (truncate_pagecache) - ->private_lock (__free_pte->block_dirty_folio) - ->swap_lock (exclusive_swap_page, others) + ->i_mmap_rwsem (truncate_pagecache) + ->private_lock (__free_pte->block_dirty_folio) + ->swap_lock (exclusive_swap_page, others) ->i_pages lock ->i_rwsem - ->invalidate_lock (acquired by fs in truncate path) - ->i_mmap_rwsem (truncate->unmap_mapping_range) + ->invalidate_lock (acquired by fs in truncate path) + ->i_mmap_rwsem (truncate->unmap_mapping_range) ->mmap_lock ->i_mmap_rwsem ->page_table_lock or pte_lock (various, mainly in memory.c) - ->i_pages lock (arch-dependent flush_dcache_mmap_lock) + ->i_pages lock (arch-dependent flush_dcache_mmap_lock) ->mmap_lock - ->invalidate_lock (filemap_fault) - ->lock_page (filemap_fault, access_process_vm) + ->invalidate_lock (filemap_fault) + ->lock_page (filemap_fault, access_process_vm) - ->i_rwsem (generic_perform_write) - ->mmap_lock (fault_in_readable->do_page_fault) + ->i_rwsem (generic_perform_write) + ->mmap_lock (fault_in_readable->do_page_fault) bdi->wb.list_lock - sb_lock (fs/fs-writeback.c) - ->i_pages lock (__sync_single_inode) + sb_lock (fs/fs-writeback.c) + ->i_pages lock (__sync_single_inode) ->i_mmap_rwsem - ->anon_vma.lock (vma_merge) + ->anon_vma.lock (vma_merge) ->anon_vma.lock ->page_table_lock or pte_lock (anon_vma_prepare and various) ->page_table_lock or pte_lock - ->swap_lock (try_to_unmap_one) - ->private_lock (try_to_unmap_one) - ->i_pages lock (try_to_unmap_one) - ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) - ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) - ->private_lock (folio_remove_rmap_pte->set_page_dirty) - ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) - bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) - ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty) - bdi.wb->list_lock (zap_pte_range->set_page_dirty) - ->inode->i_lock (zap_pte_range->set_page_dirty) - ->private_lock (zap_pte_range->block_dirty_folio) + ->swap_lock (try_to_unmap_one) + ->private_lock (try_to_unmap_one) + ->i_pages lock (try_to_unmap_one) + ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) + ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) + ->private_lock (folio_remove_rmap_pte->set_page_dirty) + ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) + bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) + ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty) + bdi.wb->list_lock (zap_pte_range->set_page_dirty) + ->inode->i_lock (zap_pte_range->set_page_dirty) + ->private_lock (zap_pte_range->block_dirty_folio) Please check the current state of these comments which may have changed since the time of writing of this document. @@ -592,7 +604,7 @@ or zapping). A typical pattern taken when traversing page table entries to install a new mapping is to optimistically determine whether the page table entry in the table above is empty, if so, only then acquiring the page table lock and checking -again to see if it was allocated underneath is. +again to see if it was allocated underneath us. This allows for a traversal with page table locks only being taken when required. An example of this is :c:func:`!__pud_alloc`. @@ -603,7 +615,7 @@ eliminated the PMD entry as well as the This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry for the PTE, carefully checking it is as expected, before acquiring the -PTE-specific lock, and then *again* checking that the PMD lock is as expected. +PTE-specific lock, and then *again* checking that the PMD entry is as expected. If a THP collapse (or similar) were to occur then the lock on both pages would be acquired, so we can ensure this is prevented while the PTE lock is held. @@ -654,7 +666,7 @@ page tables). Most notable of these is : moving higher level page tables. In these instances, it is required that **all** locks are taken, that is -the mmap lock, the VMA lock and the relevant rmap lock. +the mmap lock, the VMA lock and the relevant rmap locks. You can observe this in the :c:func:`!mremap` implementation in the functions :c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap @@ -669,11 +681,10 @@ Overview VMA read locking is entirely optimistic - if the lock is contended or a competing write has started, then we do not obtain a read lock. -A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu` function, which -first calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an -RCU critical section, then attempts to VMA lock it via -:c:func:`!vma_start_read`, before releasing the RCU lock via -:c:func:`!rcu_read_unlock`. +A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first +calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU +critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`, +before releasing the RCU lock via :c:func:`!rcu_read_unlock`. VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it _ Patches currently in -mm which might be from lorenzo.stoakes@xxxxxxxxxx are docs-mm-add-vma-locks-documentation.patch docs-mm-add-vma-locks-documentation-v3.patch docs-mm-add-vma-locks-documentation-fix.patch