Re: [PATCH] docs/mm: add VMA locks documentation

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Fri, 8 Nov 2024 10:35:30 +0000

On Thu, Nov 07, 2024 at 10:15:31PM +0100, Jann Horn wrote:
> On Thu, Nov 7, 2024 at 8:02 PM Lorenzo Stoakes
> <lorenzo.stoakes@xxxxxxxxxx> wrote:
> > Locking around VMAs is complicated and confusing. While we have a number of
> > disparate comments scattered around the place, we seem to be reaching a
> > level of complexity that justifies a serious effort at clearly documenting
> > how locks are expected to be used when it comes to interacting with
> > mm_struct and vm_area_struct objects.
>
> Thanks, I think this is looking pretty good now.

Thanks! I think we're iterating towards the final product bit by bit.

>
> > +VMA fields
> > +^^^^^^^^^^
> > +
> > +We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
> > +easier to explore their locking characteristics:
> > +
> > +.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
> > +         are in effect an internal implementation detail.
> > +
> > +.. table:: Virtual layout fields
> > +
> > +   ===================== ======================================== ===========
> > +   Field                 Description                              Write lock
> > +   ===================== ======================================== ===========
> [...]
> > +   :c:member:`!vm_pgoff` Describes the page offset into the file, rmap write.
> > +                         the original page offset within the      mmap write,
> > +                         virtual address space (prior to any      rmap write.
> > +                         :c:func:`!mremap`), or PFN if a PFN map
> > +                         and the architecture does not support
> > +                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
>
> Is that a typo in the rightmost column? "rmap write. mmap write, rmap
> write." lists rmap twice

Yep, I noticed this very shortly after sending the patch, have fixed for v2.

>
> > +   ===================== ======================================== ===========
> > +
> > +These fields describes the size, start and end of the VMA, and as such cannot be
>
> s/describes/describe/
>
> > +modified without first being hidden from the reverse mapping since these fields
> > +are used to locate VMAs within the reverse mapping interval trees.
> [...]
> > +.. table:: Reverse mapping fields
> > +
> > +   =================================== ========================================= ================
> > +   Field                               Description                               Write lock
> > +   =================================== ========================================= ================
> > +   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write,
> > +                                       mapping is file-backed, to place the VMA  VMA write,
> > +                                       in the                                    i_mmap write.
> > +                                       :c:member:`!struct address_space->i_mmap`
> > +                                      red/black interval tree.
>
> This list of write locks is correct regarding what locks you need to
> make changes to the VMA's tree membership. Technically at a lower
> level, the contents of vma->shared.rb are written while holding only
> the file rmap lock when the surrounding rbtree nodes change, but
> that's because the rbtree basically takes ownership of the node once
> it's been inserted into the tree. But I don't know how to concisely
> express that here, and it's kind of a nitpicky detail, so I think the
> current version looks good.

Yeah, I think we have to limit how far we go to keep this readable :)

>
> Maybe you could add "For changing the VMA's association with the
> rbtree:" on top of the list of write locks for this one?
>
> > +   :c:member:`!shared.rb_subtree_last` Metadata used for management of the
> > +                                       interval tree if the VMA is file-backed.  mmap write,
> > +                                                                                 VMA write,
> > +                                                                                 i_mmap write.
>
> For this one, I think it would be more accurate to say that it is
> written under just the i_mmap write lock. Though maybe that looks
> kinda inconsistent with the previous one...

But I think you probably have to hold the others to make the change, I think we
are good to hand-wave a bit here.

There's an argument for not even listing these fields as an impl detail (as
I decided to do with the VMA lock fields), but to be consistent with
anon_vma which we really do _have_ to talk about.

>
> > +   :c:member:`!anon_vma_chain`         List of links to forked/CoW’d anon_vma    mmap read,
> > +                                       objects.                                  anon_vma write.
>
> Technically not just forked/CoW'd ones, but also the current one.
> Maybe something like "List of links to anon_vma objects (including
> inherited ones) that anonymous pages can be associated with"?

Yeah, true, I thought it was important to emphasise this as you have the
one you will use for exclusively mapped folios in ->anon_vma, but we should
mention this, will update to be more accurate.

I don't like the 'related anon_vma's or the vagueness of 'associated with'
so I think better to say 'forked/CoW'd anon_vma objects and vma->anon_vma
if non-NULL'.

The whole topic of anon_vma's is a fraught mess (hey I did some nice
diagrams describing it in the book though :P) so want to be as specific as
I can here.

>
> > +   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        mmap_read,
> > +                                       anonymous folios mapped exclusively to    page_table_lock.
> > +                                      this VMA.
>
> move_vma() uses unlink_anon_vmas() to change ->anon_vma from non-NULL
> to NULL. There we hold:
>
>  - mmap lock (exclusive, from sys_mremap)
>  - VMA write lock (from move_vma)
>  - anon_vma lock (from unlink_anon_vmas)
>
> So it's not true that we always hold the page_table_lock for this.
>
> Should this maybe have two separate parts, one for "for changing NULL
> -> non-NULL" and one for "changing non-NULL to NULL"? Where the
> NULL->non-NULL scenario uses the locks you listed and non-NULL->NULL
> relies on write-locking the VMA and the anon_vma?

Yeah, there's some annoying inconsistencies, I sort of meant this as 'at
minimum' thinking the anon_vma lock might be implicit once set but that's
silly, you're right, I will be explicit here and update as you say.

Have added some more details on anon_vma_prepare() etc here too.

>
> > +   =================================== ========================================= ================
> > +
> > +These fields are used to both place the VMA within the reverse mapping, and for
> > +anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
> > +and the :c:struct:`!struct anon_vma` which folios mapped exclusively to this VMA should
>
> typo: s/which folios/in which folios/

Ack, fixed.

>
> > +reside.
> > +
> > +Page tables
> > +-----------
> > +
> > +We won't speak exhaustively on the subject but broadly speaking, page tables map
> > +virtual addresses to physical ones through a series of page tables, each of
> > +which contain entries with physical addresses for the next page table level
> > +(along with flags), and at the leaf level the physical addresses of the
> > +underlying physical data pages (with offsets into these pages provided by the
> > +virtual address itself).
> > +
> > +In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
> > +pages might eliminate one or two of these levels, but when this is the case we
> > +typically refer to the leaf level as the PTE level regardless.
>
> (That last sentence doesn't match my headcanon but I also don't have
> any reference for what is common Linux kernel phrasing around this so
> this isn't really an actionable comment.)

Does it match your head directory/entry? ;)

>
> > +.. note:: In instances where the architecture supports fewer page tables than
> > +   five the kernel cleverly 'folds' page table levels, that is skips them within
> > +   the logic, regardless we can act as if there were always five.
> > +
> > +There are three key operations typically performed on page tables:
> > +
> > +1. **Installing** page table mappings - whether creating a new mapping or
> > +   modifying an existing one.
> > +2. **Zapping/unmapping** page tables - This is what the kernel calls clearing page
>
> bikeshedding, feel free to ignore:
> Maybe "Zapping/unmapping page table entries"? At least that's how I
> always read "zap_pte_range()" in my head - "zap page table entry
> range". Though I don't know if that's the canonical interpretation.

Yeah that's a good idea actually, will update.

>
> > +   table mappings at the leaf level only, whilst leaving all page tables in
> > +   place. This is a very common operation in the kernel performed on file
> > +   truncation, the :c:macro:`!MADV_DONTNEED` operation via :c:func:`!madvise`,
> > +   and others. This is performed by a number of functions including
> > +   :c:func:`!unmap_mapping_range`, :c:func:`!unmap_mapping_pages` and reverse
> > +   mapping logic.
> [...]
> > +Locking rules
> > +^^^^^^^^^^^^^
> > +
> > +We establish basic locking rules when interacting with page tables:
> > +
> > +* When changing a page table entry the page table lock for that page table
> > +  **must** be held.
>
> (except, as you described below, in free_pgtables() when changing page
> table entries pointing to lower-level page tables)

Ack will spell that out.

>
> > +* Reads from and writes to page table entries must be appropriately atomic. See
> > +  the section on atomicity below.
> [...]
> > +Page table installation
> > +^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> > +PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
> > +acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
> > +:c:func:`!__pmd_alloc` respectively.
> > +
> > +.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
> > +   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
> > +   references the :c:member:`!mm->page_table_lock`.
> > +
> > +Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
> > +:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD
>
> typo: s/used/use/

Ack but probably better as s/used// here I think :>) Fixed.

>
> > +physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
> > +:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
> > +:c:func:`!__pte_alloc`.
> > +
> > +Finally, modifying the contents of the PTE has special treatment, as this is a
>
> nit: unclear what "this" refers to here - it looks like it refers to
> "the PTE", but "the PTE is a lock" wouldn't make grammatical sense

The PTE lock, have fixed.

>
> > +lock that we must acquire whenever we want stable and exclusive access to
> > +entries pointing to data pages within a PTE, especially when we wish to modify
> > +them.
>
> I don't think "entries pointing to data pages" need any more locking
> than other entries, like swap entries or migration markers?

Ack updated.

>
> > +This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
> > +ensure that the PTE hasn't changed from under us, ultimately invoking
> > +:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
> > +the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
> > +must be released via :c:func:`!pte_unmap_unlock`.
> > +
> > +.. note:: There are some variants on this, such as
> > +   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
> > +   for brevity we do not explore this.  See the comment for
> > +   :c:func:`!__pte_offset_map_lock` for more details.
> > +
> > +When modifying data in ranges we typically only wish to allocate higher page
> > +tables as necessary, using these locks to avoid races or overwriting anything,
> > +and set/clear data at the PTE level as required (for instance when page faulting
> > +or zapping).
> [...]
> > +Page table moving
> > +^^^^^^^^^^^^^^^^^
> > +
> > +Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
> > +page tables). Most notable of these is :c:func:`!mremap`, which is capable of
> > +moving higher level page tables.
> > +
> > +In these instances, it is either required that **all** locks are taken, that is
> > +the mmap lock, the VMA lock and the relevant rmap lock, or that the mmap lock
> > +and VMA locks are taken and some other measure is taken to avoid rmap races (see
> > +the comment in :c:func:`!move_ptes` in the :c:func:`!mremap` implementation for
> > +details of how this is handled in this instance).
>
> mremap() always takes the rmap locks when moving entire page tables,
> and AFAIK that is necessary to avoid races that lead to TLB flushes
> going to the wrong address. mremap() sometimes moves *leaf entries*
> without holding rmap locks, but never entire tables.
>
> move_pgt_entry() is confusingly written - need_rmap_locks is actually
> always true in the NORMAL_* cases that move non-leaf entries.

OK you're right, this NORMAL_ vs HPAGE_ thing... ugh mremap() needs a TOTAL
REFACTOR. Another bit of churn for Lorenzo churn king Stoakes to get to at
some point :P

Have updated.

I eventually want to put as much as I possibly can into mm/vma.c so we can
make as many things userland unit testable as possible. But that's in the
future :)

>
> > +You can observe that in the :c:func:`!mremap` implementation in the functions
> > +:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
> > +side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
> > +
> > +VMA lock internals
> > +------------------
> > +
> > +This kind of locking is entirely optimistic - if the lock is contended or a
> > +competing write has started, then we do not obtain a read lock.
> > +
> > +The :c:func:`!lock_vma_under_rcu` function first calls :c:func:`!rcu_read_lock`
> > +to ensure that the VMA is acquired in an RCU critical section, then attempts to
>
> Maybe s/is acquired in/is looked up in/, to make it clearer that
> you're talking about a VMA lookup?

Ack, this is why the maple tree is so critical here as it's
'RCU-friendly'. Have updated.

>
> > +VMA lock it via :c:func:`!vma_start_read`, before releasing the RCU lock via
> > +:c:func:`!rcu_read_unlock`.
> > +
> > +VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
> > +their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
> > +via :c:func:`!vma_end_read`.
> > +
> > +VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
> > +VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
> > +acquired. An mmap write lock **must** be held for the duration of the VMA write
> > +lock, releasing or downgrading the mmap write lock also releases the VMA write
> > +lock so there is no :c:func:`!vma_end_write` function.
> > +
> > +Note that a semaphore write lock is not held across a VMA lock. Rather, a
> > +sequence number is used for serialisation, and the write semaphore is only
> > +acquired at the point of write lock to update this.
> > +
> > +This ensures the semantics we require - VMA write locks provide exclusive write
> > +access to the VMA.
> > +
> > +The VMA lock mechanism is designed to be a lightweight means of avoiding the use
> > +of the heavily contended mmap lock. It is implemented using a combination of a
> > +read/write semaphore and sequence numbers belonging to the containing
> > +:c:struct:`!struct mm_struct` and the VMA.
> > +
> > +Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
> > +operation, i.e. it tries to acquire a read lock but returns false if it is
> > +unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
> > +called to release the VMA read lock. This can be done under RCU alone.
>
> Please clarify what "This" refers to, and whether the part about RCU
> is explaining an implementation detail or the API contract.

Ack have added a chonky comment on this.

>
> > +
> > +Writing requires the mmap to be write-locked and the VMA lock to be acquired via
> > +:c:func:`!vma_start_write`, however the write lock is released by the termination or
> > +downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
> > +
> > +All this is achieved by the use of per-mm and per-VMA sequence counts, which are
> > +used in order to reduce complexity, especially for operations which write-lock
> > +multiple VMAs at once.
> > +
> > +If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
> > +sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
> > +they differ, then they are not.
>
> nit: "it is not"?

Ack, fixed.

>
> > +
> > +Each time an mmap write lock is acquired in :c:func:`!mmap_write_lock`,
> > +:c:func:`!mmap_write_lock_nested`, :c:func:`!mmap_write_lock_killable`, the
> > +:c:member:`!mm->mm_lock_seq` sequence number is incremented via
> > +:c:func:`!mm_lock_seqcount_begin`.
> > +
> > +Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
> > +:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
> > +also increments :c:member:`!mm->mm_lock_seq` via
> > +:c:func:`!mm_lock_seqcount_end`.
> > +
> > +This way, we ensure regardless of the VMA's sequence number count, that a write
> > +lock is not incorrectly indicated (since we increment the sequence counter on
> > +acquiring the mmap write lock, which is required in order to obtain a VMA write
> > +lock), and that when we release an mmap write lock, we efficiently release
> > +**all** VMA write locks contained within the mmap at the same time.
>
> Incrementing on mmap_write_lock() is not necessary for VMA locks; that
> part is for future seqlock-style users of the MM sequence count that
> want to work without even taking the VMA lock, with the new
> mmap_lock_speculation_{begin,end} API. See commit db8f33d3b7698 and
> the thread https://lore.kernel.org/linux-mm/20241010205644.3831427-5-andrii@xxxxxxxxxx/
> .

Right, was aware of that part but thought we'd want to increment anyway,
however I suppose given you increment on lock _release_ it isn't
necessary. We will be extending this section (or... Suren maybe will? ;)