Re: [PATCH] docs/mm: add VMA locks documentation

Wei Yang <richard.weiyang@xxxxxxxxx> · Mon, 11 Nov 2024 02:42:44 +0000

On Thu, Nov 07, 2024 at 07:01:37PM +0000, Lorenzo Stoakes wrote:
>Andrew - As mm-specific docs were brought under the mm tree in a recent
>change to MAINTAINERS I believe this ought to go through your tree?
>
>
>Locking around VMAs is complicated and confusing. While we have a number of
>disparate comments scattered around the place, we seem to be reaching a
>level of complexity that justifies a serious effort at clearly documenting
>how locks are expected to be used when it comes to interacting with
>mm_struct and vm_area_struct objects.
>
>This is especially pertinent as regards the efforts to find sensible
>abstractions for these fundamental objects in kernel rust code whose
>compiler strictly requires some means of expressing these rules (and
>through this expression, self-document these requirements as well as
>enforce them).
>
>The document limits scope to mmap and VMA locks and those that are
>immediately adjacent and relevant to them - so additionally covers page
>table locking as this is so very closely tied to VMA operations (and relies
>upon us handling these correctly).
>
>The document tries to cover some of the nastier and more confusing edge
>cases and concerns especially around lock ordering and page table teardown.
>
>The document is split between generally useful information for users of mm
>interfaces, and separately a section intended for mm kernel developers
>providing a discussion around internal implementation details.
>
>Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
>---
>
>REVIEWERS NOTES:
>* Apologies if I missed any feedback, I believe I have taken everything
>  into account but do let me know if I missed anything.
>* As before, for convenience, I've uploaded a render of this document to my
>  website at https://ljs.io/output/mm/process_addrs
>* You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`.
>
>v1:
>* Removed RFC tag as I think we are iterating towards something workable
>  and there is interest.
>* Cleaned up and sharpened the language, structure and layout. Separated
>  into top-level details and implementation sections as per Alice.
>* Replaced links with rather more readable formatting.
>* Improved valid mmap/VMA lock state table.
>* Put VMA locks section into the process addresses document as per SJ and
>  Mike.
>* Made clear as to read/write operations against VMA object rather than
>  userland memory, as per Mike's suggestion, also that it does not refer to
>  page tables as per Jann.
>* Moved note into main section as per Mike's suggestion.
>* Fixed grammar mistake as per Mike.
>* Converted list-table to table as per Mike.
>* Corrected various typos as per Jann, Suren.
>* Updated reference to page fault arches as per Jann.
>* Corrected mistaken write lock criteria for vm_lock_seq as per Jann.
>* Updated vm_pgoff description to reference CONFIG_ARCH_HAS_PTE_SPECIAL as
>  per Jann.
>* Updated write lock to mmap read for vma->numab_state as per Jann.
>* Clarified that the write lock is on the mmap and VMA lock at VMA
>  granularity earlier in description as per Suren.
>* Added explicit note at top of VMA lock section to explicitly highlight
>  VMA lock semantics as per Suren.
>* Updated required locking for vma lock fields to N/A to avoid confusion as
>  per Suren.
>* Corrected description of mmap_downgrade() as per Suren.
>* Added a note on gate VMAs as per Jann.
>* Explained that taking mmap read lock under VMA lock is a bad idea due to
>  deadlock as per Jann.
>* Discussed atomicity in page table operations as per Jann.
>* Adapted the well thought out page table locking rules as provided by Jann.
>* Added a comment about pte mapping maintaining an RCU read lock.
>* Added clarification on moving page tables as informed by Jann's comments
>  (though it turns out mremap() doesn't necessarily hold all locks if it
>  can resolve races other ways :)
>* Added Jann's diagram showing lock exclusivity characteristics.
>
>RFC:
>https://lore.kernel.org/all/20241101185033.131880-1-lorenzo.stoakes@xxxxxxxxxx/
>
> Documentation/mm/process_addrs.rst | 678 +++++++++++++++++++++++++++++
> 1 file changed, 678 insertions(+)
>
>diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
>index e8618fbc62c9..a01a7bcf39ff 100644
>--- a/Documentation/mm/process_addrs.rst
>+++ b/Documentation/mm/process_addrs.rst
>@@ -3,3 +3,681 @@
> =================
> Process Addresses
> =================
>+
>+.. toctree::
>+   :maxdepth: 3
>+
>+
>+Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
>+'VMA's of type :c:struct:`!struct vm_area_struct`.
>+
>+Each VMA describes a virtually contiguous memory range with identical
>+attributes, each of which described by a :c:struct:`!struct vm_area_struct`
>+object. Userland access outside of VMAs is invalid except in the case where an
>+adjacent stack VMA could be extended to contain the accessed address.
>+
>+All VMAs are contained within one and only one virtual address space, described
>+by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
>+threads) which share the virtual address space. We refer to this as the
>+:c:struct:`!mm`.
>+
>+Each mm object contains a maple tree data structure which describes all VMAs
>+within the virtual address space.
>+
>+.. note:: An exception to this is the 'gate' VMA which is provided for
>+	  architectures which use :c:struct:`!vsyscall` and is a global static
>+	  object which does not belong to any specific mm.
>+
>+-------
>+Locking
>+-------
>+
>+The kernel is designed to be highly scalable against concurrent read operations
>+on VMA **metadata** so a complicated set of locks are required to ensure memory
>+corruption does not occur.
>+
>+.. note:: Locking VMAs for their metadata does not have any impact on the memory
>+	  they describe or the page tables that map them.
>+
>+Terminology
>+-----------
>+
>+* **mmap locks** - Each MM has a read/write semaphore `mmap_lock` which locks at
>+  a process address space granularity which can be acquired via
>+  :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
>+* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
>+  as a read/write semaphore in practice. A VMA read lock is obtained via
>+  :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
>+  write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
>+  automatically when the mmap write lock is released). To take a VMA write lock
>+  you **must** have already acquired an :c:func:`!mmap_write_lock`.
>+* **rmap locks** - When trying to access VMAs through the reverse mapping via a
>+  :c:struct:`!struct address_space *` or :c:struct:`!struct anon_vma *` object
>+  (each obtainable from a folio), VMAs must be stabilised via
>+  :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
>+  anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
>+  :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
>+  locks as the reverse mapping locks, or 'rmap locks' for brevity.
>+
>+We discuss page table locks separately in the dedicated section below.
>+
>+The first thing **any** of these locks achieve is to **stabilise** the VMA
>+within the MM tree. That is, guaranteeing that the VMA object will not be
>+deleted from under you nor modified (except for some specific exceptions
>+describe below).
>+
>+Stabilising a VMA also keeps the address space described by it around.
>+
>+Using address space locks
>+-------------------------
>+
>+If you want to **read** VMA metadata fields or just keep the VMA stable, you
>+must do one of the following:
>+
>+* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
>+  suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
>+  you're done with the VMA, *or*
>+* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
>+  acquire the lock atomically so might fail, in which case fall-back logic is
>+  required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
>+  *or*
>+* Acquire an rmap lock before traversing the locked interval tree (whether
>+  anonymous or file-backed) to obtain the required VMA.
>+
>+If you want to **write** VMA metadata fields, then things vary depending on the
>+field (we explore each VMA field in detail below). For the majority you must:
>+
>+* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
>+  suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
>+  you're done with the VMA, *and*
>+* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
>+  modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
>+  called.
>+* If you want to be able to write to **any** field, you must also hide the VMA
>+  from the reverse mapping by obtaining an **rmap write lock**.
>+
>+VMA locks are special in that you must obtain an mmap **write** lock **first**
>+in order to obtain a VMA **write** lock. A VMA **read** lock however can be
>+obtained under an RCU lock alone.
>+
>+.. note:: The primary users of VMA read locks are page fault handlers, which
>+	  means that without a VMA write lock, page faults will run concurrent with
>+	  whatever you are doing.
>+
>+Examining all valid lock states:
>+
>+.. table::
>+
>+   ========= ======== ========= ======= ===== =========== ==========
>+   mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
>+   ========= ======== ========= ======= ===== =========== ==========
>+   \-        \-       \-        N       N     N           N
>+   \-        R        \-        Y       Y     N           N
>+   \-        \-       R/W       Y       Y     N           N
>+   R/W       \-/R     \-/R/W    Y       Y     N           N
>+   W         W        \-/R      Y       Y     Y           N
>+   W         W        W         Y       Y     Y           Y
>+   ========= ======== ========= ======= ===== =========== ==========
>+
>+.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
>+	     attempting to do the reverse is invalid as it can result in deadlock - if
>+	     another task already holds an mmap write lock and attempts to acquire a VMA
>+	     write lock that will deadlock on the VMA read lock.
>+
>+All of these locks behave as read/write semaphores in practice, so you can
>+obtain either a read or a write lock for both.
>+
>+.. note:: Generally speaking, a read/write semaphore is a class of lock which
>+	  permits concurrent readers. However a write lock can only be obtained
>+	  once all readers have left the critical region (and pending readers
>+	  made to wait).
>+
>+	  This renders read locks on a read/write semaphore concurrent with other
>+	  readers and write locks exclusive against all others holding the semaphore.
>+
>+VMA fields
>+^^^^^^^^^^
>+
>+We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
>+easier to explore their locking characteristics:
>+
>+.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
>+	  are in effect an internal implementation detail.
>+
>+.. table:: Virtual layout fields
>+
>+   ===================== ======================================== ===========
>+   Field                 Description                              Write lock
>+   ===================== ======================================== ===========
>+   :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
>+                         VMA describes.                           VMA write,
>+                                                                  rmap write.
>+   :c:member:`!vm_end`   Exclusive end virtual address of range   mmap write,
>+                         VMA describes.                           VMA write,
>+                                                                  rmap write.
>+   :c:member:`!vm_pgoff` Describes the page offset into the file, rmap write.
>+                         the original page offset within the      mmap write,
>+                         virtual address space (prior to any      rmap write.
>+                         :c:func:`!mremap`), or PFN if a PFN map
>+                         and the architecture does not support
>+                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
>+   ===================== ======================================== ===========
>+
>+These fields describes the size, start and end of the VMA, and as such cannot be
>+modified without first being hidden from the reverse mapping since these fields
>+are used to locate VMAs within the reverse mapping interval trees.
>+
>+.. table:: Core fields
>+
>+   ============================ ======================================== =========================
>+   Field                        Description                              Write lock
>+   ============================ ======================================== =========================
>+   :c:member:`!vm_mm`           Containing mm_struct.                    None - written once on
>+                                                                         initial map.
>+   :c:member:`!vm_page_prot`    Architecture-specific page table         mmap write, VMA write.
>+                                protection bits determined from VMA
>+                                flags
>+   :c:member:`!vm_flags`        Read-only access to VMA flags describing N/A
>+                                attributes of the VMA, in union with
>+                                private writable
>+				:c:member:`!__vm_flags`.
>+   :c:member:`!__vm_flags`      Private, writable access to VMA flags    mmap write, VMA write.
>+                                field, updated by
>+                                :c:func:`!vm_flags_*` functions.
>+   :c:member:`!vm_file`         If the VMA is file-backed, points to a   None - written once on
>+                                struct file object describing the        initial map.
>+                                underlying file, if anonymous then
>+				:c:macro:`!NULL`.
>+   :c:member:`!vm_ops`          If the VMA is file-backed, then either   None - Written once on
>+                                the driver or file-system provides a     initial map by
>+                                :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
>+				object describing callbacks to be
>+                                invoked on VMA lifetime events.
>+   :c:member:`!vm_private_data` A :c:member:`!void *` field for          Handled by driver.
>+                                driver-specific metadata.
>+   ============================ ======================================== =========================
>+
>+These are the core fields which describe the MM the VMA belongs to and its attributes.
>+
>+.. table:: Config-specific fields
>+
>+   ================================= ===================== ======================================== ===============
>+   Field                             Configuration option  Description                              Write lock
>+   ================================= ===================== ======================================== ===============
>+   :c:member:`!anon_name`            CONFIG_ANON_VMA_NAME  A field for storing a                    mmap write,
>+                                                           :c:struct:`!struct anon_vma_name`        VMA write.
>+                                                           object providing a name for anonymous
>+                                                           mappings, or :c:macro:`!NULL` if none
>+							   is set or the VMA is file-backed.
>+   :c:member:`!swap_readahead_info`  CONFIG_SWAP           Metadata used by the swap mechanism      mmap read.
>+                                                           to perform readahead.
>+   :c:member:`!vm_policy`            CONFIG_NUMA           :c:type:`!mempolicy` object which        mmap write,
>+                                                           describes the NUMA behaviour of the      VMA write.
>+							   VMA.
>+   :c:member:`!numab_state`          CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which  mmap read.
>+                                                           describes the current state of
>+                                                           NUMA balancing in relation to this VMA.
>+                                                           Updated under mmap read lock by
>+							   :c:func:`!task_numa_work`.
>+   :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
>+                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
>+                                                           either of zero size if userfaultfd is
>+                                                           disabled, or containing a pointer
>+                                                           to an underlying
>+							   :c:type:`!userfaultfd_ctx` object which
>+                                                           describes userfaultfd metadata.
>+   ================================= ===================== ======================================== ===============
>+
>+These fields are present or not depending on whether the relevant kernel
>+configuration option is set.
>+
>+.. table:: Reverse mapping fields
>+
>+   =================================== ========================================= ================
>+   Field                               Description                               Write lock
>+   =================================== ========================================= ================
>+   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write,
>+                                       mapping is file-backed, to place the VMA  VMA write,
>+                                       in the                                    i_mmap write.
>+                                       :c:member:`!struct address_space->i_mmap`
>+				       red/black interval tree.
>+   :c:member:`!shared.rb_subtree_last` Metadata used for management of the
>+                                       interval tree if the VMA is file-backed.  mmap write,
>+                                                                                 VMA write,
>+                                                                                 i_mmap write.
>+   :c:member:`!anon_vma_chain`         List of links to forked/CoW’d anon_vma    mmap read,
>+                                       objects.                                  anon_vma write.
>+   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        mmap_read,
>+                                       anonymous folios mapped exclusively to    page_table_lock.
>+				       this VMA.
>+   =================================== ========================================= ================
>+
>+These fields are used to both place the VMA within the reverse mapping, and for
>+anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
>+and the :c:struct:`!struct anon_vma` which folios mapped exclusively to this VMA should
>+reside.
>+
>+Page tables
>+-----------
>+
>+We won't speak exhaustively on the subject but broadly speaking, page tables map
>+virtual addresses to physical ones through a series of page tables, each of
>+which contain entries with physical addresses for the next page table level
>+(along with flags), and at the leaf level the physical addresses of the
>+underlying physical data pages (with offsets into these pages provided by the
>+virtual address itself).
>+
>+In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
>+pages might eliminate one or two of these levels, but when this is the case we
>+typically refer to the leaf level as the PTE level regardless.
>+
>+.. note:: In instances where the architecture supports fewer page tables than
>+   five the kernel cleverly 'folds' page table levels, that is skips them within
>+   the logic, regardless we can act as if there were always five.
>+
>+There are three key operations typically performed on page tables:
>+
>+1. **Installing** page table mappings - whether creating a new mapping or
>+   modifying an existing one.
>+2. **Zapping/unmapping** page tables - This is what the kernel calls clearing page
>+   table mappings at the leaf level only, whilst leaving all page tables in
>+   place. This is a very common operation in the kernel performed on file
>+   truncation, the :c:macro:`!MADV_DONTNEED` operation via :c:func:`!madvise`,
>+   and others. This is performed by a number of functions including
>+   :c:func:`!unmap_mapping_range`, :c:func:`!unmap_mapping_pages` and reverse
>+   mapping logic.
>+3. **Freeing** page tables - When finally the kernel removes page tables from a
>+   userland process (typically via :c:func:`!free_pgtables`) extreme care must
>+   be taken to ensure this is done safely, as this logic finally frees all page
>+   tables in the specified range, taking no care whatsoever with existing
>+   mappings (it assumes the caller has both zapped the range and prevented any
>+   further faults within it).
>+
>+For most kernel developers, cases 1 and 3 are transparent memory management
>+implementation details that are handled behind the scenes for you (we explore
>+these details below in the implementation section).
>+
>+When **zapping** ranges, this can be done holding any one of the locks described
>+in the terminology section above - that is the mmap lock, the VMA lock or either
>+of the reverse mapping locks.
>+
>+That is - as long as you keep the relevant VMA **stable**, you are good to go
>+ahead and zap memory in that VMA's range.
>+
>+.. warning:: When **freeing** page tables, it must not be possible for VMAs
>+	     containing the ranges those page tables map to be accessible via
>+	     the reverse mapping.
>+
>+	     The :c:func:`!free_pgtables` function removes the relevant VMAs
>+	     from the reverse mappings, but no other VMAs can be permitted to be
>+	     accessible and span the specified range.
>+
>+Lock ordering
>+-------------
>+
>+As we have multiple locks across the kernel which may or may not be taken at the
>+same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
>+the **order** in which locks are acquired and released becomes very important.
>+
>+.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
>+   but in doing so inadvertently cause a mutual deadlock.
>+
>+   For example, consider thread 1 which holds lock A and tries to acquire lock B,
>+   while thread 2 holds lock B and tries to acquire lock A.
>+
>+   Both threads are now deadlocked on each other. However, had they attempted to
>+   acquire locks in the same order, one would have waited for the other to
>+   complete its work and no deadlock would have occurred.
>+
>+The opening comment in `mm/rmap.c` describes in detail the required ordering of
>+locks within memory management code:
>+
>+.. code-block::
>+
>+  inode->i_rwsem	(while writing or truncating, not reading or faulting)
>+    mm->mmap_lock
>+      mapping->invalidate_lock (in filemap_fault)
>+        folio_lock
>+          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
>+            vma_start_write
>+              mapping->i_mmap_rwsem
>+                anon_vma->rwsem
>+                  mm->page_table_lock or pte_lock
>+                    swap_lock (in swap_duplicate, swap_info_get)
>+                      mmlist_lock (in mmput, drain_mmlist and others)
>+                      mapping->private_lock (in block_dirty_folio)
>+                          i_pages lock (widely used)
>+                            lruvec->lru_lock (in folio_lruvec_lock_irq)
>+                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
>+                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
>+                        sb_lock (within inode_lock in fs/fs-writeback.c)
>+                        i_pages lock (widely used, in set_page_dirty,
>+                                  in arch-dependent flush_dcache_mmap_lock,
>+                                  within bdi.wb->list_lock in __sync_single_inode)
>+
>+Please check the current state of this comment which may have changed since the
>+time of writing of this document.
>+
>+------------------------------
>+Locking Implementation Details
>+------------------------------
>+
>+Page table locking details
>+--------------------------
>+
>+In addition to the locks described in the terminology section above, we have
>+additional locks dedicated to page tables:
>+
>+* **Higher level page table locks** - Higher level page tables, that is PGD, P4D
>+  and PUD each make use of the process address space granularity
>+  :c:member:`!mm->page_table_lock` lock when modified.
>+
>+* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
>+  either kept within the folios describing the page tables or allocated
>+  separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
>+  set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
>+  mapped into higher memory (if a 32-bit system) and carefully locked via
>+  :c:func:`!pte_offset_map_lock`.
>+
>+These locks represent the minimum required to interact with each page table
>+level, but there are further requirements.
>+
>+Locking rules
>+^^^^^^^^^^^^^
>+
>+We establish basic locking rules when interacting with page tables:
>+
>+* When changing a page table entry the page table lock for that page table
>+  **must** be held.
>+* Reads from and writes to page table entries must be appropriately atomic. See
>+  the section on atomicity below.
>+* Populating previously empty entries requires that the mmap or VMA locks are
>+  held, doing so with only rmap locks would risk a race with unmapping logic
>+  invoking :c:func:`!unmap_vmas`, so is forbidden.
>+* As mentioned above, zapping can be performed while simply keeping the VMA
>+  stable, that is holding any one of the mmap, VMA or rmap locks.
>+* Special care is required for PTEs, as on 32-bit architectures these must be
>+  mapped into high memory and additionally, careful consideration must be
>+  applied to racing with THP, migration or other concurrent kernel operations
>+  that might steal the entire PTE table from under us. All this is handled by
>+  :c:func:`!pte_offset_map_lock`.
>+
>+There are additional rules applicable when moving page tables, which we discuss
>+in the section on this topic below.
>+
>+.. note:: Interestingly, :c:func:`!pte_offset_map_lock` also maintains an RCU
>+          read lock over the mapping (and therefore combined mapping and
>+          locking) operation.
>+
>+Atomicity
>+^^^^^^^^^
>+
>+Page table entries must always be retrieved once and only once before being
>+interacted with, as we are operating concurrently with other operations and the
>+hardware.
>+
>+Regardless of page table locks, the MMU hardware will update accessed and dirty
>+bits (and in some architectures, perhaps more), and kernel functionality like
>+GUP-fast locklessly traverses page tables, so we cannot safely assume that page
>+table locks give us exclusive access.
>+
>+If we hold page table locks and are reading page table entries, then we need
>+only ensure that the compiler does not rearrange our loads. This is achieved via
>+:c:func:`!pXXp_get` functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`,
>+:c:func:`!pudp_get`, :c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
>+
>+Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
>+the page table entry only once.
>+
>+However, if we wish to manipulate an existing page table entry and care about
>+the previously stored data, we must go further and use an hardware atomic
>+operation as, for example, in :c:func:`!ptep_get_and_clear`.
>+
>+Equally, operations that do not rely on the page table locks, such as GUP-fast
>+(for instance see :c:func:`!gup_fast` and its various page table level handlers
>+like :c:func:`!gup_fast_pte_range`), must very carefully interact with page
>+table entries, using functions such as :c:func:`!ptep_get_lockless` and
>+equivalent for higher page table levels.
>+
>+Writes to page table entries must also be appropriately atomic, as established
>+by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
>+:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
>+
>+
>+Page table installation
>+^^^^^^^^^^^^^^^^^^^^^^^
>+
>+When allocating a P4D, PUD or PMD and setting the relevant entry in the above
>+PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
>+acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
>+:c:func:`!__pmd_alloc` respectively.
>+
>+.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
>+   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
>+   references the :c:member:`!mm->page_table_lock`.
>+
>+Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
>+:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD
>+physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
>+:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
>+:c:func:`!__pte_alloc`.
>+
>+Finally, modifying the contents of the PTE has special treatment, as this is a
>+lock that we must acquire whenever we want stable and exclusive access to
>+entries pointing to data pages within a PTE, especially when we wish to modify
>+them.
>+
>+This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
>+ensure that the PTE hasn't changed from under us, ultimately invoking
>+:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
>+the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
>+must be released via :c:func:`!pte_unmap_unlock`.
>+
>+.. note:: There are some variants on this, such as
>+   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
>+   for brevity we do not explore this.  See the comment for
>+   :c:func:`!__pte_offset_map_lock` for more details.
>+
>+When modifying data in ranges we typically only wish to allocate higher page
>+tables as necessary, using these locks to avoid races or overwriting anything,
>+and set/clear data at the PTE level as required (for instance when page faulting
>+or zapping).
>+
>+Page table freeing
>+^^^^^^^^^^^^^^^^^^
>+
>+Tearing down page tables themselves is something that requires significant
>+care. There must be no way that page tables designated for removal can be
>+traversed or referenced by concurrent tasks.
>+
>+It is insufficient to simply hold an mmap write lock and VMA lock (which will
>+prevent racing faults, and rmap operations), as a file-backed mapping can be
>+truncated under the :c:struct:`!struct address_space` i_mmap_lock alone.
>+
>+As a result, no VMA which can be accessed via the reverse mapping (either
>+anon_vma or the :c:member:`!struct address_space->i_mmap` interval tree) can
>+have its page tables torn down.
>+
>+The operation is typically performed via :c:func:`!free_pgtables`, which assumes
>+either the mmap write lock has been taken (as specified by its
>+:c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable.
>+
>+It carefully removes the VMA from all reverse mappings, however it's important
>+that no new ones overlap these or any route remain to permit access to addresses
>+within the range whose page tables are being torn down.
>+
>+As a result of these careful conditions, note that page table entries are
>+cleared without page table locks, as it is assumed that all of these precautions
>+have already been taken (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`,
>+:c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions - note that at this
>+stage it is assumed that PTE entries have been zapped).
>+
>+.. note:: It is possible for leaf page tables to be torn down, independent of
>+          the page tables above it, as is done by
>+          :c:func:`!retract_page_tables`, which is performed under the i_mmap
>+          read lock, PMD, and PTE page table locks, without this level of care.
>+
>+Page table moving
>+^^^^^^^^^^^^^^^^^
>+
>+Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
>+page tables). Most notable of these is :c:func:`!mremap`, which is capable of
>+moving higher level page tables.
>+
>+In these instances, it is either required that **all** locks are taken, that is
>+the mmap lock, the VMA lock and the relevant rmap lock, or that the mmap lock
>+and VMA locks are taken and some other measure is taken to avoid rmap races (see
>+the comment in :c:func:`!move_ptes` in the :c:func:`!mremap` implementation for
>+details of how this is handled in this instance).
>+
>+You can observe that in the :c:func:`!mremap` implementation in the functions
>+:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
>+side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
>+
>+VMA lock internals
>+------------------
>+
>+This kind of locking is entirely optimistic - if the lock is contended or a
>+competing write has started, then we do not obtain a read lock.
>+
>+The :c:func:`!lock_vma_under_rcu` function first calls :c:func:`!rcu_read_lock`
>+to ensure that the VMA is acquired in an RCU critical section, then attempts to
>+VMA lock it via :c:func:`!vma_start_read`, before releasing the RCU lock via
>+:c:func:`!rcu_read_unlock`.
>+
>+VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
>+their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
>+via :c:func:`!vma_end_read`.
>+
>+VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
>+VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
>+acquired. An mmap write lock **must** be held for the duration of the VMA write
>+lock, releasing or downgrading the mmap write lock also releases the VMA write
>+lock so there is no :c:func:`!vma_end_write` function.
>+
>+Note that a semaphore write lock is not held across a VMA lock. Rather, a
>+sequence number is used for serialisation, and the write semaphore is only
>+acquired at the point of write lock to update this.
>+
>+This ensures the semantics we require - VMA write locks provide exclusive write
>+access to the VMA.
>+
>+The VMA lock mechanism is designed to be a lightweight means of avoiding the use
>+of the heavily contended mmap lock. It is implemented using a combination of a
>+read/write semaphore and sequence numbers belonging to the containing
>+:c:struct:`!struct mm_struct` and the VMA.
>+
>+Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
>+operation, i.e. it tries to acquire a read lock but returns false if it is
>+unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
>+called to release the VMA read lock. This can be done under RCU alone.
>+
>+Writing requires the mmap to be write-locked and the VMA lock to be acquired via
>+:c:func:`!vma_start_write`, however the write lock is released by the termination or
>+downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
>+
>+All this is achieved by the use of per-mm and per-VMA sequence counts, which are
>+used in order to reduce complexity, especially for operations which write-lock
>+multiple VMAs at once.
>+

Hi, Lorenzo

I got a question more than a comment when trying to understand the mechanism.

I am thinking about the benefit of PER_VMA_LOCK.

For write, we always need to grab mmap_lock w/o PER_VMA_LOCK.
For read, seems we don't need to grab mmap_lock with PER_VMA_LOCK. So read
operation benefit the most with PER_VMA_LOCK, right?

>+If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
>+sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
>+they differ, then they are not.
>+
>+Each time an mmap write lock is acquired in :c:func:`!mmap_write_lock`,
>+:c:func:`!mmap_write_lock_nested`, :c:func:`!mmap_write_lock_killable`, the
>+:c:member:`!mm->mm_lock_seq` sequence number is incremented via
>+:c:func:`!mm_lock_seqcount_begin`.
>+
>+Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
>+:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
>+also increments :c:member:`!mm->mm_lock_seq` via
>+:c:func:`!mm_lock_seqcount_end`.
>+
>+This way, we ensure regardless of the VMA's sequence number count, that a write
>+lock is not incorrectly indicated (since we increment the sequence counter on
>+acquiring the mmap write lock, which is required in order to obtain a VMA write
>+lock), and that when we release an mmap write lock, we efficiently release
>+**all** VMA write locks contained within the mmap at the same time.
>+
>+The exclusivity of the mmap write lock ensures this is what we want, as there
>+would never be a reason to persist per-VMA write locks across multiple mmap
>+write lock acquisitions.
>+
>+Each time a VMA read lock is acquired, we acquire a read lock on the
>+:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
>+the sequence count of the VMA does not match that of the mm.
>+
>+If it does, the read lock fails. If it does not, we hold the lock, excluding
>+writers, but permitting other readers, who will also obtain this lock under RCU.
>+
>+Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
>+are also RCU safe, so the whole read lock operation is guaranteed to function
>+correctly.
>+
>+On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
>+read/write semaphore, before setting the VMA's sequence number under this lock,
>+also simultaneously holding the mmap write lock.
>+
>+This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
>+until these are finished and mutual exclusion is achieved.
>+
>+After setting the VMA's sequence number, the lock is released, avoiding
>+complexity with a long-term held write lock.
>+
>+This clever combination of a read/write semaphore and sequence count allows for
>+fast RCU-based per-VMA lock acquisition (especially on page fault, though
>+utilised elsewhere) with minimal complexity around lock ordering.
>+
>+mmap write lock downgrading
>+---------------------------
>+
>+When an mmap write lock is held, one has exclusive access to resources within
>+the mmap (with the usual caveats about requiring VMA write locks to avoid races
>+with tasks holding VMA read locks).
>+
>+It is then possible to **downgrade** from a write lock to a read lock via
>+:c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
>+implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
>+importantly does not relinquish the mmap lock while downgrading, therefore
>+keeping the locked virtual address space stable.
>+
>+An interesting consequence of this is that downgraded locks will be exclusive
>+against any other task possessing a downgraded lock (since they'd have to
>+acquire a write lock first to do so, and the lock now being a read lock prevents
>+this).
>+
>+For clarity, mapping read (R)/downgraded write (D)/write (W) locks against one
>+another showing which locks exclude the others:
>+
>+.. list-table:: Lock exclusivity
>+   :widths: 5 5 5 5
>+   :header-rows: 1
>+   :stub-columns: 1
>+
>+   * -
>+     - R
>+     - D
>+     - W
>+   * - R
>+     - N
>+     - N
>+     - Y
>+   * - D
>+     - N
>+     - Y
>+     - Y
>+   * - W
>+     - Y
>+     - Y
>+     - Y
>+
>+Here a Y indicates the locks in the matching row/column exclude one another, and
>+N indicates that they do not.
>+
>+Stack expansion
>+---------------
>+
>+Stack expansion throws up additional complexities in that we cannot permit there
>+to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to
>+prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.
>--
>2.47.0

-- 
Wei Yang
Help you, Help me