On Wed, Nov 13, 2024 at 08:46:39PM +0100, Jann Horn wrote: > On Fri, Nov 8, 2024 at 2:57 PM Lorenzo Stoakes > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > Locking around VMAs is complicated and confusing. While we have a number of > > disparate comments scattered around the place, we seem to be reaching a > > level of complexity that justifies a serious effort at clearly documenting > > how locks are expected to be used when it comes to interacting with > > mm_struct and vm_area_struct objects. > > > > This is especially pertinent as regards the efforts to find sensible > > abstractions for these fundamental objects in kernel rust code whose > > compiler strictly requires some means of expressing these rules (and > > through this expression, self-document these requirements as well as > > enforce them). > > > > The document limits scope to mmap and VMA locks and those that are > > immediately adjacent and relevant to them - so additionally covers page > > table locking as this is so very closely tied to VMA operations (and relies > > upon us handling these correctly). > > > > The document tries to cover some of the nastier and more confusing edge > > cases and concerns especially around lock ordering and page table teardown. > > > > The document is split between generally useful information for users of mm > > interfaces, and separately a section intended for mm kernel developers > > providing a discussion around internal implementation details. > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > Reviewed-by: Jann Horn <jannh@xxxxxxxxxx> Thanks! Will do a quick respin to fixup what you raised (+ pull in the fix-patch there too obv.) > > except some typos and one inaccuracy: > > > +* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock` > > + which locks at a process address space granularity which can be acquired via > > + :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants. > > +* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves > > + as a read/write semaphore in practice. A VMA read lock is obtained via > > + :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a > > + write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked > > + automatically when the mmap write lock is released). To take a VMA write lock > > + you **must** have already acquired an :c:func:`!mmap_write_lock`. > > +* **rmap locks** - When trying to access VMAs through the reverse mapping via a > > + :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object > > + (reachable from a folio via :c:member:`!folio->mapping`) VMAs must be stabilised via > > missing dot between sentences? Ack > > > +These fields describes the size, start and end of the VMA, and as such cannot be > > +modified without first being hidden from the reverse mapping since these fields > > +are used to locate VMAs within the reverse mapping interval trees. > > still a typo here, "these fields describes" This is how we speak in Devon ;) will fix (the typo that is, not my Devonshire dialect). > > > +.. note:: In instances where the architecture supports fewer page tables than > > + five the kernel cleverly 'folds' page table levels, that is stubbing > > + out functions related to the skipped levels. This allows us to > > + conceptually act is if there were always five levels, even if the > > typo: s/is if/as if/ Ack fixed. > > > +1. **Traversing** page tables - Simply reading page tables in order to traverse > > + them. This only requires that the VMA is kept stable, so a lock which > > + establishes this suffices for traversal (there are also lockless variants > > + which eliminate even this requirement, such as :c:func:`!gup_fast`). > > +2. **Installing** page table mappings - Whether creating a new mapping or > > + modifying an existing one. This requires that the VMA is kept stable via an > > + mmap or VMA lock (explicitly not rmap locks). > > Arguably clearing A/D bits through the rmap, and switching PTEs > between present entries and migration entries, counts as "modifying an > existing one", but the locking for that is like for zapping/unmapping > (both page_idle_clear_pte_refs and try_to_migrate go through the > rmap). So "modifying an existing one" either needs more caveats or > more likely should just be moved to point three. Yeah I think this is a terminology thing, maybe better to say modifying which changes _identity_. I don't think it's right to classify these as zapping. Better to just put a separate note underneath about this. Hopefully that should suffice to keep things comprehensible here :) > > > +3. **Zapping/unmapping** page table entries - This is what the kernel calls > > + clearing page table mappings at the leaf level only, whilst leaving all page > > + tables in place. This is a very common operation in the kernel performed on > > + file truncation, the :c:macro:`!MADV_DONTNEED` operation via > > + :c:func:`!madvise`, and others. This is performed by a number of functions > > + including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages` > > + among others. The VMA need only be kept stable for this operation. > > +4. **Freeing** page tables - When finally the kernel removes page tables from a > > + userland process (typically via :c:func:`!free_pgtables`) extreme care must > > + be taken to ensure this is done safely, as this logic finally frees all page > > + tables in the specified range, ignoring existing leaf entries (it assumes the > > + caller has both zapped the range and prevented any further faults or > > + modifications within it). > > + > > +**Traversing** and **zapping** ranges can be performed holding any one of the > > +locks described in the terminology section above - that is the mmap lock, the > > +VMA lock or either of the reverse mapping locks. > > + > > +That is - as long as you keep the relevant VMA **stable** - you are good to go > > +ahead and perform these operations on page tables (though internally, kernel > > +operations that perform writes also acquire internal page table locks to > > +serialise - see the page table implementation detail section for more details). > > + > > +When **installing** page table entries, the mmap or VMA lock mut be held to keep > > typo: "must be held" Ack. > > > +When performing a page table traversal and keeping the VMA stable, whether a > > +read must be performed once and only once or not depends on the architecture > > +(for instance x86-64 does not require any special precautions). > > Nitpick: In theory that'd still be a data race with other threads > concurrently installing page tables, though in practice compilers > don't seem to do anything bad with stuff like that. There's nothing to prevent anything at all... presumably there's some characteristic of x86-64 that makes all this _somehow_ ok. Maybe somethinmg to expand on in future. > > > +A typical pattern taken when traversing page table entries to install a new > > +mapping is to optimistically determine whether the page table entry in the table > > +above is empty, if so, only then acquiring the page table lock and checking > > +again to see if it was allocated underneath is. > > s/ is// ? s/is/us/ :>) fixed. > > > +This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry > > +for the PTE, carefully checking it is as expected, before acquiring the > > +PTE-specific lock, and then *again* checking that the PMD lock is as expected. > > s/PMD lock is/PMD entry is/ ? ack, fixed. > > > +In these instances, it is required that **all** locks are taken, that is > > +the mmap lock, the VMA lock and the relevant rmap lock. > > s/rmap lock/rmap locks/ Hm... you mean ref: the MAP_PRIVATE weirdness meaning we might need two. Sigh. I _hate_ anon_vma's. I may do something about them. But also ack, fixing. > > > +VMA read locking is entirely optimistic - if the lock is contended or a competing > > +write has started, then we do not obtain a read lock. > > + > > +A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu` function, which > > "is obtained by lock_vma_under_rcu function" sounds weird, maybe > either "is obtained by lock_vma_under_rcu" or "is obtained by the > lock_vma_under_rcu function"? Ack agreed, will fix.