Re: [PATCH v2] docs/mm: add VMA locks documentation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Nov 13, 2024 at 08:46:39PM +0100, Jann Horn wrote:
> On Fri, Nov 8, 2024 at 2:57 PM Lorenzo Stoakes
> <lorenzo.stoakes@xxxxxxxxxx> wrote:
> > Locking around VMAs is complicated and confusing. While we have a number of
> > disparate comments scattered around the place, we seem to be reaching a
> > level of complexity that justifies a serious effort at clearly documenting
> > how locks are expected to be used when it comes to interacting with
> > mm_struct and vm_area_struct objects.
> >
> > This is especially pertinent as regards the efforts to find sensible
> > abstractions for these fundamental objects in kernel rust code whose
> > compiler strictly requires some means of expressing these rules (and
> > through this expression, self-document these requirements as well as
> > enforce them).
> >
> > The document limits scope to mmap and VMA locks and those that are
> > immediately adjacent and relevant to them - so additionally covers page
> > table locking as this is so very closely tied to VMA operations (and relies
> > upon us handling these correctly).
> >
> > The document tries to cover some of the nastier and more confusing edge
> > cases and concerns especially around lock ordering and page table teardown.
> >
> > The document is split between generally useful information for users of mm
> > interfaces, and separately a section intended for mm kernel developers
> > providing a discussion around internal implementation details.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
>
> Reviewed-by: Jann Horn <jannh@xxxxxxxxxx>

Thanks! Will do a quick respin to fixup what you raised (+ pull in the fix-patch
there too obv.)

>
> except some typos and one inaccuracy:
>
> > +* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
> > +  which locks at a process address space granularity which can be acquired via
> > +  :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
> > +* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
> > +  as a read/write semaphore in practice. A VMA read lock is obtained via
> > +  :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
> > +  write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
> > +  automatically when the mmap write lock is released). To take a VMA write lock
> > +  you **must** have already acquired an :c:func:`!mmap_write_lock`.
> > +* **rmap locks** - When trying to access VMAs through the reverse mapping via a
> > +  :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
> > +  (reachable from a folio via :c:member:`!folio->mapping`) VMAs must be stabilised via
>
> missing dot between sentences?

Ack

>
> > +These fields describes the size, start and end of the VMA, and as such cannot be
> > +modified without first being hidden from the reverse mapping since these fields
> > +are used to locate VMAs within the reverse mapping interval trees.
>
> still a typo here, "these fields describes"

This is how we speak in Devon ;) will fix (the typo that is, not my
Devonshire dialect).

>
> > +.. note:: In instances where the architecture supports fewer page tables than
> > +         five the kernel cleverly 'folds' page table levels, that is stubbing
> > +         out functions related to the skipped levels. This allows us to
> > +         conceptually act is if there were always five levels, even if the
>
> typo: s/is if/as if/

Ack fixed.

>
> > +1. **Traversing** page tables - Simply reading page tables in order to traverse
> > +   them. This only requires that the VMA is kept stable, so a lock which
> > +   establishes this suffices for traversal (there are also lockless variants
> > +   which eliminate even this requirement, such as :c:func:`!gup_fast`).
> > +2. **Installing** page table mappings - Whether creating a new mapping or
> > +   modifying an existing one. This requires that the VMA is kept stable via an
> > +   mmap or VMA lock (explicitly not rmap locks).
>
> Arguably clearing A/D bits through the rmap, and switching PTEs
> between present entries and migration entries, counts as "modifying an
> existing one", but the locking for that is like for zapping/unmapping
> (both page_idle_clear_pte_refs and try_to_migrate go through the
> rmap). So "modifying an existing one" either needs more caveats or
> more likely should just be moved to point three.

Yeah I think this is a terminology thing, maybe better to say modifying which
changes _identity_.

I don't think it's right to classify these as zapping. Better to just put a
separate note underneath about this.

Hopefully that should suffice to keep things comprehensible here :)

>
> > +3. **Zapping/unmapping** page table entries - This is what the kernel calls
> > +   clearing page table mappings at the leaf level only, whilst leaving all page
> > +   tables in place. This is a very common operation in the kernel performed on
> > +   file truncation, the :c:macro:`!MADV_DONTNEED` operation via
> > +   :c:func:`!madvise`, and others. This is performed by a number of functions
> > +   including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`
> > +   among others. The VMA need only be kept stable for this operation.
> > +4. **Freeing** page tables - When finally the kernel removes page tables from a
> > +   userland process (typically via :c:func:`!free_pgtables`) extreme care must
> > +   be taken to ensure this is done safely, as this logic finally frees all page
> > +   tables in the specified range, ignoring existing leaf entries (it assumes the
> > +   caller has both zapped the range and prevented any further faults or
> > +   modifications within it).
> > +
> > +**Traversing** and **zapping** ranges can be performed holding any one of the
> > +locks described in the terminology section above - that is the mmap lock, the
> > +VMA lock or either of the reverse mapping locks.
> > +
> > +That is - as long as you keep the relevant VMA **stable** - you are good to go
> > +ahead and perform these operations on page tables (though internally, kernel
> > +operations that perform writes also acquire internal page table locks to
> > +serialise - see the page table implementation detail section for more details).
> > +
> > +When **installing** page table entries, the mmap or VMA lock mut be held to keep
>
> typo: "must be held"

Ack.

>
> > +When performing a page table traversal and keeping the VMA stable, whether a
> > +read must be performed once and only once or not depends on the architecture
> > +(for instance x86-64 does not require any special precautions).
>
> Nitpick: In theory that'd still be a data race with other threads
> concurrently installing page tables, though in practice compilers
> don't seem to do anything bad with stuff like that.

There's nothing to prevent anything at all... presumably there's some
characteristic of x86-64 that makes all this _somehow_ ok. Maybe somethinmg
to expand on in future.

>
> > +A typical pattern taken when traversing page table entries to install a new
> > +mapping is to optimistically determine whether the page table entry in the table
> > +above is empty, if so, only then acquiring the page table lock and checking
> > +again to see if it was allocated underneath is.
>
> s/ is// ?

s/is/us/ :>) fixed.

>
> > +This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
> > +for the PTE, carefully checking it is as expected, before acquiring the
> > +PTE-specific lock, and then *again* checking that the PMD lock is as expected.
>
> s/PMD lock is/PMD entry is/ ?

ack, fixed.

>
> > +In these instances, it is required that **all** locks are taken, that is
> > +the mmap lock, the VMA lock and the relevant rmap lock.
>
> s/rmap lock/rmap locks/

Hm... you mean ref: the MAP_PRIVATE weirdness meaning we might need
two. Sigh.

I _hate_ anon_vma's. I may do something about them.

But also ack, fixing.

>
> > +VMA read locking is entirely optimistic - if the lock is contended or a competing
> > +write has started, then we do not obtain a read lock.
> > +
> > +A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu` function, which
>
> "is obtained by lock_vma_under_rcu function" sounds weird, maybe
> either "is obtained by lock_vma_under_rcu" or "is obtained by the
> lock_vma_under_rcu function"?

Ack agreed, will fix.




[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux