Re: [PATCH] docs/mm: add VMA locks documentation

Jann Horn <jannh@xxxxxxxxxx> · Thu, 7 Nov 2024 22:15:31 +0100

On Thu, Nov 7, 2024 at 8:02 PM Lorenzo Stoakes
<lorenzo.stoakes@xxxxxxxxxx> wrote:
> Locking around VMAs is complicated and confusing. While we have a number of
> disparate comments scattered around the place, we seem to be reaching a
> level of complexity that justifies a serious effort at clearly documenting
> how locks are expected to be used when it comes to interacting with
> mm_struct and vm_area_struct objects.

Thanks, I think this is looking pretty good now.

> +VMA fields
> +^^^^^^^^^^
> +
> +We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
> +easier to explore their locking characteristics:
> +
> +.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
> +         are in effect an internal implementation detail.
> +
> +.. table:: Virtual layout fields
> +
> +   ===================== ======================================== ===========
> +   Field                 Description                              Write lock
> +   ===================== ======================================== ===========
[...]
> +   :c:member:`!vm_pgoff` Describes the page offset into the file, rmap write.
> +                         the original page offset within the      mmap write,
> +                         virtual address space (prior to any      rmap write.
> +                         :c:func:`!mremap`), or PFN if a PFN map
> +                         and the architecture does not support
> +                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.

Is that a typo in the rightmost column? "rmap write. mmap write, rmap
write." lists rmap twice

> +   ===================== ======================================== ===========
> +
> +These fields describes the size, start and end of the VMA, and as such cannot be

s/describes/describe/

> +modified without first being hidden from the reverse mapping since these fields
> +are used to locate VMAs within the reverse mapping interval trees.
[...]
> +.. table:: Reverse mapping fields
> +
> +   =================================== ========================================= ================
> +   Field                               Description                               Write lock
> +   =================================== ========================================= ================
> +   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write,
> +                                       mapping is file-backed, to place the VMA  VMA write,
> +                                       in the                                    i_mmap write.
> +                                       :c:member:`!struct address_space->i_mmap`
> +                                      red/black interval tree.

This list of write locks is correct regarding what locks you need to
make changes to the VMA's tree membership. Technically at a lower
level, the contents of vma->shared.rb are written while holding only
the file rmap lock when the surrounding rbtree nodes change, but
that's because the rbtree basically takes ownership of the node once
it's been inserted into the tree. But I don't know how to concisely
express that here, and it's kind of a nitpicky detail, so I think the
current version looks good.

Maybe you could add "For changing the VMA's association with the
rbtree:" on top of the list of write locks for this one?

> +   :c:member:`!shared.rb_subtree_last` Metadata used for management of the
> +                                       interval tree if the VMA is file-backed.  mmap write,
> +                                                                                 VMA write,
> +                                                                                 i_mmap write.

For this one, I think it would be more accurate to say that it is
written under just the i_mmap write lock. Though maybe that looks
kinda inconsistent with the previous one...

> +   :c:member:`!anon_vma_chain`         List of links to forked/CoW’d anon_vma    mmap read,
> +                                       objects.                                  anon_vma write.

Technically not just forked/CoW'd ones, but also the current one.
Maybe something like "List of links to anon_vma objects (including
inherited ones) that anonymous pages can be associated with"?

> +   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        mmap_read,
> +                                       anonymous folios mapped exclusively to    page_table_lock.
> +                                      this VMA.

move_vma() uses unlink_anon_vmas() to change ->anon_vma from non-NULL
to NULL. There we hold:

 - mmap lock (exclusive, from sys_mremap)
 - VMA write lock (from move_vma)
 - anon_vma lock (from unlink_anon_vmas)

So it's not true that we always hold the page_table_lock for this.

Should this maybe have two separate parts, one for "for changing NULL
-> non-NULL" and one for "changing non-NULL to NULL"? Where the
NULL->non-NULL scenario uses the locks you listed and non-NULL->NULL
relies on write-locking the VMA and the anon_vma?

> +   =================================== ========================================= ================
> +
> +These fields are used to both place the VMA within the reverse mapping, and for
> +anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
> +and the :c:struct:`!struct anon_vma` which folios mapped exclusively to this VMA should

typo: s/which folios/in which folios/

> +reside.
> +
> +Page tables
> +-----------
> +
> +We won't speak exhaustively on the subject but broadly speaking, page tables map
> +virtual addresses to physical ones through a series of page tables, each of
> +which contain entries with physical addresses for the next page table level
> +(along with flags), and at the leaf level the physical addresses of the
> +underlying physical data pages (with offsets into these pages provided by the
> +virtual address itself).
> +
> +In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
> +pages might eliminate one or two of these levels, but when this is the case we
> +typically refer to the leaf level as the PTE level regardless.

(That last sentence doesn't match my headcanon but I also don't have
any reference for what is common Linux kernel phrasing around this so
this isn't really an actionable comment.)

> +.. note:: In instances where the architecture supports fewer page tables than
> +   five the kernel cleverly 'folds' page table levels, that is skips them within
> +   the logic, regardless we can act as if there were always five.
> +
> +There are three key operations typically performed on page tables:
> +
> +1. **Installing** page table mappings - whether creating a new mapping or
> +   modifying an existing one.
> +2. **Zapping/unmapping** page tables - This is what the kernel calls clearing page

bikeshedding, feel free to ignore:
Maybe "Zapping/unmapping page table entries"? At least that's how I
always read "zap_pte_range()" in my head - "zap page table entry
range". Though I don't know if that's the canonical interpretation.

> +   table mappings at the leaf level only, whilst leaving all page tables in
> +   place. This is a very common operation in the kernel performed on file
> +   truncation, the :c:macro:`!MADV_DONTNEED` operation via :c:func:`!madvise`,
> +   and others. This is performed by a number of functions including
> +   :c:func:`!unmap_mapping_range`, :c:func:`!unmap_mapping_pages` and reverse
> +   mapping logic.
[...]
> +Locking rules
> +^^^^^^^^^^^^^
> +
> +We establish basic locking rules when interacting with page tables:
> +
> +* When changing a page table entry the page table lock for that page table
> +  **must** be held.

(except, as you described below, in free_pgtables() when changing page
table entries pointing to lower-level page tables)

> +* Reads from and writes to page table entries must be appropriately atomic. See
> +  the section on atomicity below.
[...]
> +Page table installation
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> +PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
> +acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
> +:c:func:`!__pmd_alloc` respectively.
> +
> +.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
> +   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
> +   references the :c:member:`!mm->page_table_lock`.
> +
> +Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
> +:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD

typo: s/used/use/

> +physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
> +:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
> +:c:func:`!__pte_alloc`.
> +
> +Finally, modifying the contents of the PTE has special treatment, as this is a

nit: unclear what "this" refers to here - it looks like it refers to
"the PTE", but "the PTE is a lock" wouldn't make grammatical sense

> +lock that we must acquire whenever we want stable and exclusive access to
> +entries pointing to data pages within a PTE, especially when we wish to modify
> +them.

I don't think "entries pointing to data pages" need any more locking
than other entries, like swap entries or migration markers?

> +This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
> +ensure that the PTE hasn't changed from under us, ultimately invoking
> +:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
> +the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
> +must be released via :c:func:`!pte_unmap_unlock`.
> +
> +.. note:: There are some variants on this, such as
> +   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
> +   for brevity we do not explore this.  See the comment for
> +   :c:func:`!__pte_offset_map_lock` for more details.
> +
> +When modifying data in ranges we typically only wish to allocate higher page
> +tables as necessary, using these locks to avoid races or overwriting anything,
> +and set/clear data at the PTE level as required (for instance when page faulting
> +or zapping).
[...]
> +Page table moving
> +^^^^^^^^^^^^^^^^^
> +
> +Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
> +page tables). Most notable of these is :c:func:`!mremap`, which is capable of
> +moving higher level page tables.
> +
> +In these instances, it is either required that **all** locks are taken, that is
> +the mmap lock, the VMA lock and the relevant rmap lock, or that the mmap lock
> +and VMA locks are taken and some other measure is taken to avoid rmap races (see
> +the comment in :c:func:`!move_ptes` in the :c:func:`!mremap` implementation for
> +details of how this is handled in this instance).

mremap() always takes the rmap locks when moving entire page tables,
and AFAIK that is necessary to avoid races that lead to TLB flushes
going to the wrong address. mremap() sometimes moves *leaf entries*
without holding rmap locks, but never entire tables.

move_pgt_entry() is confusingly written - need_rmap_locks is actually
always true in the NORMAL_* cases that move non-leaf entries.

> +You can observe that in the :c:func:`!mremap` implementation in the functions
> +:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
> +side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
> +
> +VMA lock internals
> +------------------
> +
> +This kind of locking is entirely optimistic - if the lock is contended or a
> +competing write has started, then we do not obtain a read lock.
> +
> +The :c:func:`!lock_vma_under_rcu` function first calls :c:func:`!rcu_read_lock`
> +to ensure that the VMA is acquired in an RCU critical section, then attempts to

Maybe s/is acquired in/is looked up in/, to make it clearer that
you're talking about a VMA lookup?

> +VMA lock it via :c:func:`!vma_start_read`, before releasing the RCU lock via
> +:c:func:`!rcu_read_unlock`.
> +
> +VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
> +their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
> +via :c:func:`!vma_end_read`.
> +
> +VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
> +VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
> +acquired. An mmap write lock **must** be held for the duration of the VMA write
> +lock, releasing or downgrading the mmap write lock also releases the VMA write
> +lock so there is no :c:func:`!vma_end_write` function.
> +
> +Note that a semaphore write lock is not held across a VMA lock. Rather, a
> +sequence number is used for serialisation, and the write semaphore is only
> +acquired at the point of write lock to update this.
> +
> +This ensures the semantics we require - VMA write locks provide exclusive write
> +access to the VMA.
> +
> +The VMA lock mechanism is designed to be a lightweight means of avoiding the use
> +of the heavily contended mmap lock. It is implemented using a combination of a
> +read/write semaphore and sequence numbers belonging to the containing
> +:c:struct:`!struct mm_struct` and the VMA.
> +
> +Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
> +operation, i.e. it tries to acquire a read lock but returns false if it is
> +unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
> +called to release the VMA read lock. This can be done under RCU alone.

Please clarify what "This" refers to, and whether the part about RCU
is explaining an implementation detail or the API contract.

> +
> +Writing requires the mmap to be write-locked and the VMA lock to be acquired via
> +:c:func:`!vma_start_write`, however the write lock is released by the termination or
> +downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
> +
> +All this is achieved by the use of per-mm and per-VMA sequence counts, which are
> +used in order to reduce complexity, especially for operations which write-lock
> +multiple VMAs at once.
> +
> +If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
> +sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
> +they differ, then they are not.

nit: "it is not"?

> +
> +Each time an mmap write lock is acquired in :c:func:`!mmap_write_lock`,
> +:c:func:`!mmap_write_lock_nested`, :c:func:`!mmap_write_lock_killable`, the
> +:c:member:`!mm->mm_lock_seq` sequence number is incremented via
> +:c:func:`!mm_lock_seqcount_begin`.
> +
> +Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
> +:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
> +also increments :c:member:`!mm->mm_lock_seq` via
> +:c:func:`!mm_lock_seqcount_end`.
> +
> +This way, we ensure regardless of the VMA's sequence number count, that a write
> +lock is not incorrectly indicated (since we increment the sequence counter on
> +acquiring the mmap write lock, which is required in order to obtain a VMA write
> +lock), and that when we release an mmap write lock, we efficiently release
> +**all** VMA write locks contained within the mmap at the same time.

Incrementing on mmap_write_lock() is not necessary for VMA locks; that
part is for future seqlock-style users of the MM sequence count that
want to work without even taking the VMA lock, with the new
mmap_lock_speculation_{begin,end} API. See commit db8f33d3b7698 and
the thread https://lore.kernel.org/linux-mm/20241010205644.3831427-5-andrii@xxxxxxxxxx/
.