On Fri, Nov 8, 2024 at 2:57 PM Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote: > Locking around VMAs is complicated and confusing. While we have a number of > disparate comments scattered around the place, we seem to be reaching a > level of complexity that justifies a serious effort at clearly documenting > how locks are expected to be used when it comes to interacting with > mm_struct and vm_area_struct objects. > > This is especially pertinent as regards the efforts to find sensible > abstractions for these fundamental objects in kernel rust code whose > compiler strictly requires some means of expressing these rules (and > through this expression, self-document these requirements as well as > enforce them). > > The document limits scope to mmap and VMA locks and those that are > immediately adjacent and relevant to them - so additionally covers page > table locking as this is so very closely tied to VMA operations (and relies > upon us handling these correctly). > > The document tries to cover some of the nastier and more confusing edge > cases and concerns especially around lock ordering and page table teardown. > > The document is split between generally useful information for users of mm > interfaces, and separately a section intended for mm kernel developers > providing a discussion around internal implementation details. > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> Reviewed-by: Jann Horn <jannh@xxxxxxxxxx> except some typos and one inaccuracy: > +* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock` > + which locks at a process address space granularity which can be acquired via > + :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants. > +* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves > + as a read/write semaphore in practice. A VMA read lock is obtained via > + :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a > + write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked > + automatically when the mmap write lock is released). To take a VMA write lock > + you **must** have already acquired an :c:func:`!mmap_write_lock`. > +* **rmap locks** - When trying to access VMAs through the reverse mapping via a > + :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object > + (reachable from a folio via :c:member:`!folio->mapping`) VMAs must be stabilised via missing dot between sentences? > +These fields describes the size, start and end of the VMA, and as such cannot be > +modified without first being hidden from the reverse mapping since these fields > +are used to locate VMAs within the reverse mapping interval trees. still a typo here, "these fields describes" > +.. note:: In instances where the architecture supports fewer page tables than > + five the kernel cleverly 'folds' page table levels, that is stubbing > + out functions related to the skipped levels. This allows us to > + conceptually act is if there were always five levels, even if the typo: s/is if/as if/ > +1. **Traversing** page tables - Simply reading page tables in order to traverse > + them. This only requires that the VMA is kept stable, so a lock which > + establishes this suffices for traversal (there are also lockless variants > + which eliminate even this requirement, such as :c:func:`!gup_fast`). > +2. **Installing** page table mappings - Whether creating a new mapping or > + modifying an existing one. This requires that the VMA is kept stable via an > + mmap or VMA lock (explicitly not rmap locks). Arguably clearing A/D bits through the rmap, and switching PTEs between present entries and migration entries, counts as "modifying an existing one", but the locking for that is like for zapping/unmapping (both page_idle_clear_pte_refs and try_to_migrate go through the rmap). So "modifying an existing one" either needs more caveats or more likely should just be moved to point three. > +3. **Zapping/unmapping** page table entries - This is what the kernel calls > + clearing page table mappings at the leaf level only, whilst leaving all page > + tables in place. This is a very common operation in the kernel performed on > + file truncation, the :c:macro:`!MADV_DONTNEED` operation via > + :c:func:`!madvise`, and others. This is performed by a number of functions > + including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages` > + among others. The VMA need only be kept stable for this operation. > +4. **Freeing** page tables - When finally the kernel removes page tables from a > + userland process (typically via :c:func:`!free_pgtables`) extreme care must > + be taken to ensure this is done safely, as this logic finally frees all page > + tables in the specified range, ignoring existing leaf entries (it assumes the > + caller has both zapped the range and prevented any further faults or > + modifications within it). > + > +**Traversing** and **zapping** ranges can be performed holding any one of the > +locks described in the terminology section above - that is the mmap lock, the > +VMA lock or either of the reverse mapping locks. > + > +That is - as long as you keep the relevant VMA **stable** - you are good to go > +ahead and perform these operations on page tables (though internally, kernel > +operations that perform writes also acquire internal page table locks to > +serialise - see the page table implementation detail section for more details). > + > +When **installing** page table entries, the mmap or VMA lock mut be held to keep typo: "must be held" > +When performing a page table traversal and keeping the VMA stable, whether a > +read must be performed once and only once or not depends on the architecture > +(for instance x86-64 does not require any special precautions). Nitpick: In theory that'd still be a data race with other threads concurrently installing page tables, though in practice compilers don't seem to do anything bad with stuff like that. > +A typical pattern taken when traversing page table entries to install a new > +mapping is to optimistically determine whether the page table entry in the table > +above is empty, if so, only then acquiring the page table lock and checking > +again to see if it was allocated underneath is. s/ is// ? > +This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry > +for the PTE, carefully checking it is as expected, before acquiring the > +PTE-specific lock, and then *again* checking that the PMD lock is as expected. s/PMD lock is/PMD entry is/ ? > +In these instances, it is required that **all** locks are taken, that is > +the mmap lock, the VMA lock and the relevant rmap lock. s/rmap lock/rmap locks/ > +VMA read locking is entirely optimistic - if the lock is contended or a competing > +write has started, then we do not obtain a read lock. > + > +A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu` function, which "is obtained by lock_vma_under_rcu function" sounds weird, maybe either "is obtained by lock_vma_under_rcu" or "is obtained by the lock_vma_under_rcu function"?