The quilt patch titled Subject: docs-mm-add-vma-locks-documentation-v3 has been removed from the -mm tree. Its filename was docs-mm-add-vma-locks-documentation-v3.patch This patch was dropped because it was folded into docs-mm-add-vma-locks-documentation.patch ------------------------------------------------------ From: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> Subject: docs-mm-add-vma-locks-documentation-v3 Date: Thu, 14 Nov 2024 20:54:01 +0000 Link: https://lkml.kernel.org/r/20241114205402.859737-1-lorenzo.stoakes@xxxxxxxxxx Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> Signed-off-by: Jann Horn <jannh@xxxxxxxxxx> Acked-by: Mike Rapoport (Microsoft) <rppt@xxxxxxxxxx> Acked-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> (for page table locks part) Reviewed-by: Bagas Sanjaya <bagasdotme@xxxxxxxxx> Reviewed-by: Jann Horn <jannh@xxxxxxxxxx> Cc: Alice Ryhl <aliceryhl@xxxxxxxxxx> Cc: Boqun Feng <boqun.feng@xxxxxxxxx> Cc: Hillf Danton <hdanton@xxxxxxxx> Cc: Jonathan Corbet <corbet@xxxxxxx> Cc: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> Cc: SeongJae Park <sj@xxxxxxxxxx> Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx> Cc: Vlastimil Babka <vbabka@xxxxxxx> [lorenzo.stoakes@xxxxxxxxxx: docs/mm: minor corrections] Link: https://lkml.kernel.org/r/d3de735a-25ae-4eb2-866c-a9624fe6f795@lucifer.local [jannh@xxxxxxxxxx: docs/mm: add more warnings around page table access] Link: https://lkml.kernel.org/r/20241118-vma-docs-addition1-onv3-v2-1-c9d5395b72ee@xxxxxxxxxx Cc: Vlastimil Babka <vbabka@xxxxxxx> From: Wei Yang <richard.weiyang@xxxxxxxxx> Subject: maple_tree: use mas_next_slot() directly Date: Mon, 25 Nov 2024 02:41:56 +0000 The loop condition makes sure (mas.last < max), so we can directly use mas_next_slot() here. Since no other use of mas_next_entry(), it is removed. Link: https://lkml.kernel.org/r/20241125024156.26093-1-richard.weiyang@xxxxxxxxx Signed-off-by: Wei Yang <richard.weiyang@xxxxxxxxx> Reviewed-by: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> Cc: Sidhartha Kumar <sidhartha.kumar@xxxxxxxxxx> Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- Documentation/mm/process_addrs.rst | 161 ++++++++++++++++----------- 1 file changed, 99 insertions(+), 62 deletions(-) --- a/Documentation/mm/process_addrs.rst~docs-mm-add-vma-locks-documentation-v3 +++ a/Documentation/mm/process_addrs.rst @@ -53,7 +53,7 @@ Terminology you **must** have already acquired an :c:func:`!mmap_write_lock`. * **rmap locks** - When trying to access VMAs through the reverse mapping via a :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object - (reachable from a folio via :c:member:`!folio->mapping`) VMAs must be stabilised via + (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for anonymous memory and :c:func:`!i_mmap_[try]lock_read` or :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these @@ -68,8 +68,8 @@ described below). Stabilising a VMA also keeps the address space described by it around. -Using address space locks -------------------------- +Lock usage +---------- If you want to **read** VMA metadata fields or just keep the VMA stable, you must do one of the following: @@ -101,6 +101,9 @@ in order to obtain a VMA **write** lock. obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then release an RCU lock to lookup the VMA for you). +This constrains the impact of writers on readers, as a writer can interact with +one VMA while a reader interacts with another simultaneously. + .. note:: The primary users of VMA read locks are page fault handlers, which means that without a VMA write lock, page faults will run concurrent with whatever you are doing. @@ -209,13 +212,17 @@ These are the core fields which describe :c:struct:`!struct anon_vma_name` VMA write. object providing a name for anonymous mappings, or :c:macro:`!NULL` if none - is set or the VMA is file-backed. + is set or the VMA is file-backed. The + underlying object is reference counted + and can be shared across multiple VMAs + for scalability. :c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism mmap read, to perform readahead. This field is swap-specific accessed atomically. lock. :c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write, describes the NUMA behaviour of the VMA write. - VMA. + VMA. The underlying object is reference + counted. :c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which mmap read, describes the current state of numab-specific NUMA balancing in relation to this VMA. lock. @@ -287,26 +294,27 @@ typically refer to the leaf level as the .. note:: In instances where the architecture supports fewer page tables than five the kernel cleverly 'folds' page table levels, that is stubbing out functions related to the skipped levels. This allows us to - conceptually act is if there were always five levels, even if the + conceptually act as if there were always five levels, even if the compiler might, in practice, eliminate any code relating to missing ones. -There are free key operations typically performed on page tables: +There are four key operations typically performed on page tables: 1. **Traversing** page tables - Simply reading page tables in order to traverse them. This only requires that the VMA is kept stable, so a lock which establishes this suffices for traversal (there are also lockless variants which eliminate even this requirement, such as :c:func:`!gup_fast`). 2. **Installing** page table mappings - Whether creating a new mapping or - modifying an existing one. This requires that the VMA is kept stable via an - mmap or VMA lock (explicitly not rmap locks). + modifying an existing one in such a way as to change its identity. This + requires that the VMA is kept stable via an mmap or VMA lock (explicitly not + rmap locks). 3. **Zapping/unmapping** page table entries - This is what the kernel calls clearing page table mappings at the leaf level only, whilst leaving all page tables in place. This is a very common operation in the kernel performed on file truncation, the :c:macro:`!MADV_DONTNEED` operation via :c:func:`!madvise`, and others. This is performed by a number of functions - including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages` - among others. The VMA need only be kept stable for this operation. + including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`. + The VMA need only be kept stable for this operation. 4. **Freeing** page tables - When finally the kernel removes page tables from a userland process (typically via :c:func:`!free_pgtables`) extreme care must be taken to ensure this is done safely, as this logic finally frees all page @@ -314,6 +322,10 @@ There are free key operations typically caller has both zapped the range and prevented any further faults or modifications within it). +.. note:: Modifying mappings for reclaim or migration is performed under rmap + lock as it, like zapping, does not fundamentally modify the identity + of what is being mapped. + **Traversing** and **zapping** ranges can be performed holding any one of the locks described in the terminology section above - that is the mmap lock, the VMA lock or either of the reverse mapping locks. @@ -323,9 +335,14 @@ ahead and perform these operations on pa operations that perform writes also acquire internal page table locks to serialise - see the page table implementation detail section for more details). -When **installing** page table entries, the mmap or VMA lock mut be held to keep -the VMA stable. We explore why this is in the page table locking details section -below. +When **installing** page table entries, the mmap or VMA lock must be held to +keep the VMA stable. We explore why this is in the page table locking details +section below. + +.. warning:: Page tables are normally only traversed in regions covered by VMAs. + If you want to traverse page tables in areas that might not be + covered by VMAs, heavier locking is required. + See :c:func:`!walk_page_range_novma` for details. **Freeing** page tables is an entirely internal memory management operation and has special requirements (see the page freeing section below for more details). @@ -386,50 +403,50 @@ There is also a file-system specific loc .. code-block:: - ->i_mmap_rwsem (truncate_pagecache) - ->private_lock (__free_pte->block_dirty_folio) - ->swap_lock (exclusive_swap_page, others) + ->i_mmap_rwsem (truncate_pagecache) + ->private_lock (__free_pte->block_dirty_folio) + ->swap_lock (exclusive_swap_page, others) ->i_pages lock ->i_rwsem - ->invalidate_lock (acquired by fs in truncate path) - ->i_mmap_rwsem (truncate->unmap_mapping_range) + ->invalidate_lock (acquired by fs in truncate path) + ->i_mmap_rwsem (truncate->unmap_mapping_range) ->mmap_lock ->i_mmap_rwsem ->page_table_lock or pte_lock (various, mainly in memory.c) - ->i_pages lock (arch-dependent flush_dcache_mmap_lock) + ->i_pages lock (arch-dependent flush_dcache_mmap_lock) ->mmap_lock - ->invalidate_lock (filemap_fault) - ->lock_page (filemap_fault, access_process_vm) + ->invalidate_lock (filemap_fault) + ->lock_page (filemap_fault, access_process_vm) - ->i_rwsem (generic_perform_write) - ->mmap_lock (fault_in_readable->do_page_fault) + ->i_rwsem (generic_perform_write) + ->mmap_lock (fault_in_readable->do_page_fault) bdi->wb.list_lock - sb_lock (fs/fs-writeback.c) - ->i_pages lock (__sync_single_inode) + sb_lock (fs/fs-writeback.c) + ->i_pages lock (__sync_single_inode) ->i_mmap_rwsem - ->anon_vma.lock (vma_merge) + ->anon_vma.lock (vma_merge) ->anon_vma.lock ->page_table_lock or pte_lock (anon_vma_prepare and various) ->page_table_lock or pte_lock - ->swap_lock (try_to_unmap_one) - ->private_lock (try_to_unmap_one) - ->i_pages lock (try_to_unmap_one) - ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) - ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) - ->private_lock (folio_remove_rmap_pte->set_page_dirty) - ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) - bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) - ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty) - bdi.wb->list_lock (zap_pte_range->set_page_dirty) - ->inode->i_lock (zap_pte_range->set_page_dirty) - ->private_lock (zap_pte_range->block_dirty_folio) + ->swap_lock (try_to_unmap_one) + ->private_lock (try_to_unmap_one) + ->i_pages lock (try_to_unmap_one) + ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) + ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) + ->private_lock (folio_remove_rmap_pte->set_page_dirty) + ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) + bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) + ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty) + bdi.wb->list_lock (zap_pte_range->set_page_dirty) + ->inode->i_lock (zap_pte_range->set_page_dirty) + ->private_lock (zap_pte_range->block_dirty_folio) Please check the current state of these comments which may have changed since the time of writing of this document. @@ -438,6 +455,9 @@ the time of writing of this document. Locking Implementation Details ------------------------------ +.. warning:: Locking rules for PTE-level page tables are very different from + locking rules for page tables at other levels. + Page table locking details -------------------------- @@ -458,8 +478,12 @@ additional locks dedicated to page table These locks represent the minimum required to interact with each page table level, but there are further requirements. -Importantly, note that on a **traversal** of page tables, no such locks are -taken. Whether care is taken on reading the page table entries depends on the +Importantly, note that on a **traversal** of page tables, sometimes no such +locks are taken. However, at the PTE level, at least concurrent page table +deletion must be prevented (using RCU) and the page table must be mapped into +high memory, see below. + +Whether care is taken on reading the page table entries depends on the architecture, see the section on atomicity below. Locking rules @@ -477,12 +501,6 @@ We establish basic locking rules when in the warning below). * As mentioned previously, zapping can be performed while simply keeping the VMA stable, that is holding any one of the mmap, VMA or rmap locks. -* Special care is required for PTEs, as on 32-bit architectures these must be - mapped into high memory and additionally, careful consideration must be - applied to racing with THP, migration or other concurrent kernel operations - that might steal the entire PTE table from under us. All this is handled by - :c:func:`!pte_offset_map_lock` (see the section on page table installation - below for more details). .. warning:: Populating previously empty entries is dangerous as, when unmapping VMAs, :c:func:`!vms_clear_ptes` has a window of time between @@ -497,8 +515,28 @@ We establish basic locking rules when in There are additional rules applicable when moving page tables, which we discuss in the section on this topic below. -.. note:: Interestingly, :c:func:`!pte_offset_map_lock` holds an RCU read lock - while the PTE page table lock is held. +PTE-level page tables are different from page tables at other levels, and there +are extra requirements for accessing them: + +* On 32-bit architectures, they may be in high memory (meaning they need to be + mapped into kernel memory to be accessible). +* When empty, they can be unlinked and RCU-freed while holding an mmap lock or + rmap lock for reading in combination with the PTE and PMD page table locks. + In particular, this happens in :c:func:`!retract_page_tables` when handling + :c:macro:`!MADV_COLLAPSE`. + So accessing PTE-level page tables requires at least holding an RCU read lock; + but that only suffices for readers that can tolerate racing with concurrent + page table updates such that an empty PTE is observed (in a page table that + has actually already been detached and marked for RCU freeing) while another + new page table has been installed in the same location and filled with + entries. Writers normally need to take the PTE lock and revalidate that the + PMD entry still refers to the same PTE-level page table. + +To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or +:c:func:`!pte_offset_map` can be used depending on stability requirements. +These map the page table into kernel memory if required, take the RCU lock, and +depending on variant, may also look up or acquire the PTE lock. +See the comment on :c:func:`!__pte_offset_map_lock`. Atomicity ^^^^^^^^^ @@ -513,11 +551,11 @@ When performing a page table traversal a read must be performed once and only once or not depends on the architecture (for instance x86-64 does not require any special precautions). -It is on the write side, or if a read informs whether a write takes place (on an -installation of a page table entry say, for instance in -:c:func:`!__pud_install`), where special care must always be taken. In these -cases we can never assume that page table locks give us entirely exclusive -access, and must retrieve page table entries once and only once. +If a write is being performed, or if a read informs whether a write takes place +(on an installation of a page table entry say, for instance in +:c:func:`!__pud_install`), special care must always be taken. In these cases we +can never assume that page table locks give us entirely exclusive access, and +must retrieve page table entries once and only once. If we are reading page table entries, then we need only ensure that the compiler does not rearrange our loads. This is achieved via :c:func:`!pXXp_get` @@ -592,7 +630,7 @@ or zapping). A typical pattern taken when traversing page table entries to install a new mapping is to optimistically determine whether the page table entry in the table above is empty, if so, only then acquiring the page table lock and checking -again to see if it was allocated underneath is. +again to see if it was allocated underneath us. This allows for a traversal with page table locks only being taken when required. An example of this is :c:func:`!__pud_alloc`. @@ -603,7 +641,7 @@ eliminated the PMD entry as well as the This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry for the PTE, carefully checking it is as expected, before acquiring the -PTE-specific lock, and then *again* checking that the PMD lock is as expected. +PTE-specific lock, and then *again* checking that the PMD entry is as expected. If a THP collapse (or similar) were to occur then the lock on both pages would be acquired, so we can ensure this is prevented while the PTE lock is held. @@ -654,7 +692,7 @@ page tables). Most notable of these is : moving higher level page tables. In these instances, it is required that **all** locks are taken, that is -the mmap lock, the VMA lock and the relevant rmap lock. +the mmap lock, the VMA lock and the relevant rmap locks. You can observe this in the :c:func:`!mremap` implementation in the functions :c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap @@ -669,11 +707,10 @@ Overview VMA read locking is entirely optimistic - if the lock is contended or a competing write has started, then we do not obtain a read lock. -A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu` function, which -first calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an -RCU critical section, then attempts to VMA lock it via -:c:func:`!vma_start_read`, before releasing the RCU lock via -:c:func:`!rcu_read_unlock`. +A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first +calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU +critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`, +before releasing the RCU lock via :c:func:`!rcu_read_unlock`. VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it _ Patches currently in -mm which might be from lorenzo.stoakes@xxxxxxxxxx are mm-reinstate-ability-to-map-write-sealed-memfd-mappings-read-only.patch selftests-memfd-add-test-for-mapping-write-sealed-memfd-read-only.patch mm-correct-typo-in-mmap_state-macro.patch docs-mm-add-vma-locks-documentation.patch