On Thu, Nov 07, 2024 at 10:15:31PM +0100, Jann Horn wrote: > On Thu, Nov 7, 2024 at 8:02 PM Lorenzo Stoakes > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > Locking around VMAs is complicated and confusing. While we have a number of > > disparate comments scattered around the place, we seem to be reaching a > > level of complexity that justifies a serious effort at clearly documenting > > how locks are expected to be used when it comes to interacting with > > mm_struct and vm_area_struct objects. > > Thanks, I think this is looking pretty good now. Thanks! I think we're iterating towards the final product bit by bit. > > > +VMA fields > > +^^^^^^^^^^ > > + > > +We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it > > +easier to explore their locking characteristics: > > + > > +.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these > > + are in effect an internal implementation detail. > > + > > +.. table:: Virtual layout fields > > + > > + ===================== ======================================== =========== > > + Field Description Write lock > > + ===================== ======================================== =========== > [...] > > + :c:member:`!vm_pgoff` Describes the page offset into the file, rmap write. > > + the original page offset within the mmap write, > > + virtual address space (prior to any rmap write. > > + :c:func:`!mremap`), or PFN if a PFN map > > + and the architecture does not support > > + :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`. > > Is that a typo in the rightmost column? "rmap write. mmap write, rmap > write." lists rmap twice Yep, I noticed this very shortly after sending the patch, have fixed for v2. > > > + ===================== ======================================== =========== > > + > > +These fields describes the size, start and end of the VMA, and as such cannot be > > s/describes/describe/ > > > +modified without first being hidden from the reverse mapping since these fields > > +are used to locate VMAs within the reverse mapping interval trees. > [...] > > +.. table:: Reverse mapping fields > > + > > + =================================== ========================================= ================ > > + Field Description Write lock > > + =================================== ========================================= ================ > > + :c:member:`!shared.rb` A red/black tree node used, if the mmap write, > > + mapping is file-backed, to place the VMA VMA write, > > + in the i_mmap write. > > + :c:member:`!struct address_space->i_mmap` > > + red/black interval tree. > > This list of write locks is correct regarding what locks you need to > make changes to the VMA's tree membership. Technically at a lower > level, the contents of vma->shared.rb are written while holding only > the file rmap lock when the surrounding rbtree nodes change, but > that's because the rbtree basically takes ownership of the node once > it's been inserted into the tree. But I don't know how to concisely > express that here, and it's kind of a nitpicky detail, so I think the > current version looks good. Yeah, I think we have to limit how far we go to keep this readable :) > > Maybe you could add "For changing the VMA's association with the > rbtree:" on top of the list of write locks for this one? > > > + :c:member:`!shared.rb_subtree_last` Metadata used for management of the > > + interval tree if the VMA is file-backed. mmap write, > > + VMA write, > > + i_mmap write. > > For this one, I think it would be more accurate to say that it is > written under just the i_mmap write lock. Though maybe that looks > kinda inconsistent with the previous one... But I think you probably have to hold the others to make the change, I think we are good to hand-wave a bit here. There's an argument for not even listing these fields as an impl detail (as I decided to do with the VMA lock fields), but to be consistent with anon_vma which we really do _have_ to talk about. > > > + :c:member:`!anon_vma_chain` List of links to forked/CoW’d anon_vma mmap read, > > + objects. anon_vma write. > > Technically not just forked/CoW'd ones, but also the current one. > Maybe something like "List of links to anon_vma objects (including > inherited ones) that anonymous pages can be associated with"? Yeah, true, I thought it was important to emphasise this as you have the one you will use for exclusively mapped folios in ->anon_vma, but we should mention this, will update to be more accurate. I don't like the 'related anon_vma's or the vagueness of 'associated with' so I think better to say 'forked/CoW'd anon_vma objects and vma->anon_vma if non-NULL'. The whole topic of anon_vma's is a fraught mess (hey I did some nice diagrams describing it in the book though :P) so want to be as specific as I can here. > > > + :c:member:`!anon_vma` :c:type:`!anon_vma` object used by mmap_read, > > + anonymous folios mapped exclusively to page_table_lock. > > + this VMA. > > move_vma() uses unlink_anon_vmas() to change ->anon_vma from non-NULL > to NULL. There we hold: > > - mmap lock (exclusive, from sys_mremap) > - VMA write lock (from move_vma) > - anon_vma lock (from unlink_anon_vmas) > > So it's not true that we always hold the page_table_lock for this. > > Should this maybe have two separate parts, one for "for changing NULL > -> non-NULL" and one for "changing non-NULL to NULL"? Where the > NULL->non-NULL scenario uses the locks you listed and non-NULL->NULL > relies on write-locking the VMA and the anon_vma? Yeah, there's some annoying inconsistencies, I sort of meant this as 'at minimum' thinking the anon_vma lock might be implicit once set but that's silly, you're right, I will be explicit here and update as you say. Have added some more details on anon_vma_prepare() etc here too. > > > + =================================== ========================================= ================ > > + > > +These fields are used to both place the VMA within the reverse mapping, and for > > +anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects > > +and the :c:struct:`!struct anon_vma` which folios mapped exclusively to this VMA should > > typo: s/which folios/in which folios/ Ack, fixed. > > > +reside. > > + > > +Page tables > > +----------- > > + > > +We won't speak exhaustively on the subject but broadly speaking, page tables map > > +virtual addresses to physical ones through a series of page tables, each of > > +which contain entries with physical addresses for the next page table level > > +(along with flags), and at the leaf level the physical addresses of the > > +underlying physical data pages (with offsets into these pages provided by the > > +virtual address itself). > > + > > +In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge > > +pages might eliminate one or two of these levels, but when this is the case we > > +typically refer to the leaf level as the PTE level regardless. > > (That last sentence doesn't match my headcanon but I also don't have > any reference for what is common Linux kernel phrasing around this so > this isn't really an actionable comment.) Does it match your head directory/entry? ;) > > > +.. note:: In instances where the architecture supports fewer page tables than > > + five the kernel cleverly 'folds' page table levels, that is skips them within > > + the logic, regardless we can act as if there were always five. > > + > > +There are three key operations typically performed on page tables: > > + > > +1. **Installing** page table mappings - whether creating a new mapping or > > + modifying an existing one. > > +2. **Zapping/unmapping** page tables - This is what the kernel calls clearing page > > bikeshedding, feel free to ignore: > Maybe "Zapping/unmapping page table entries"? At least that's how I > always read "zap_pte_range()" in my head - "zap page table entry > range". Though I don't know if that's the canonical interpretation. Yeah that's a good idea actually, will update. > > > + table mappings at the leaf level only, whilst leaving all page tables in > > + place. This is a very common operation in the kernel performed on file > > + truncation, the :c:macro:`!MADV_DONTNEED` operation via :c:func:`!madvise`, > > + and others. This is performed by a number of functions including > > + :c:func:`!unmap_mapping_range`, :c:func:`!unmap_mapping_pages` and reverse > > + mapping logic. > [...] > > +Locking rules > > +^^^^^^^^^^^^^ > > + > > +We establish basic locking rules when interacting with page tables: > > + > > +* When changing a page table entry the page table lock for that page table > > + **must** be held. > > (except, as you described below, in free_pgtables() when changing page > table entries pointing to lower-level page tables) Ack will spell that out. > > > +* Reads from and writes to page table entries must be appropriately atomic. See > > + the section on atomicity below. > [...] > > +Page table installation > > +^^^^^^^^^^^^^^^^^^^^^^^ > > + > > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above > > +PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is > > +acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and > > +:c:func:`!__pmd_alloc` respectively. > > + > > +.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and > > + :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately > > + references the :c:member:`!mm->page_table_lock`. > > + > > +Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if > > +:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD > > typo: s/used/use/ Ack but probably better as s/used// here I think :>) Fixed. > > > +physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by > > +:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately > > +:c:func:`!__pte_alloc`. > > + > > +Finally, modifying the contents of the PTE has special treatment, as this is a > > nit: unclear what "this" refers to here - it looks like it refers to > "the PTE", but "the PTE is a lock" wouldn't make grammatical sense The PTE lock, have fixed. > > > +lock that we must acquire whenever we want stable and exclusive access to > > +entries pointing to data pages within a PTE, especially when we wish to modify > > +them. > > I don't think "entries pointing to data pages" need any more locking > than other entries, like swap entries or migration markers? Ack updated. > > > +This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to > > +ensure that the PTE hasn't changed from under us, ultimately invoking > > +:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within > > +the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock > > +must be released via :c:func:`!pte_unmap_unlock`. > > + > > +.. note:: There are some variants on this, such as > > + :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but > > + for brevity we do not explore this. See the comment for > > + :c:func:`!__pte_offset_map_lock` for more details. > > + > > +When modifying data in ranges we typically only wish to allocate higher page > > +tables as necessary, using these locks to avoid races or overwriting anything, > > +and set/clear data at the PTE level as required (for instance when page faulting > > +or zapping). > [...] > > +Page table moving > > +^^^^^^^^^^^^^^^^^ > > + > > +Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD > > +page tables). Most notable of these is :c:func:`!mremap`, which is capable of > > +moving higher level page tables. > > + > > +In these instances, it is either required that **all** locks are taken, that is > > +the mmap lock, the VMA lock and the relevant rmap lock, or that the mmap lock > > +and VMA locks are taken and some other measure is taken to avoid rmap races (see > > +the comment in :c:func:`!move_ptes` in the :c:func:`!mremap` implementation for > > +details of how this is handled in this instance). > > mremap() always takes the rmap locks when moving entire page tables, > and AFAIK that is necessary to avoid races that lead to TLB flushes > going to the wrong address. mremap() sometimes moves *leaf entries* > without holding rmap locks, but never entire tables. > > move_pgt_entry() is confusingly written - need_rmap_locks is actually > always true in the NORMAL_* cases that move non-leaf entries. OK you're right, this NORMAL_ vs HPAGE_ thing... ugh mremap() needs a TOTAL REFACTOR. Another bit of churn for Lorenzo churn king Stoakes to get to at some point :P Have updated. I eventually want to put as much as I possibly can into mm/vma.c so we can make as many things userland unit testable as possible. But that's in the future :) > > > +You can observe that in the :c:func:`!mremap` implementation in the functions > > +:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap > > +side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`. > > + > > +VMA lock internals > > +------------------ > > + > > +This kind of locking is entirely optimistic - if the lock is contended or a > > +competing write has started, then we do not obtain a read lock. > > + > > +The :c:func:`!lock_vma_under_rcu` function first calls :c:func:`!rcu_read_lock` > > +to ensure that the VMA is acquired in an RCU critical section, then attempts to > > Maybe s/is acquired in/is looked up in/, to make it clearer that > you're talking about a VMA lookup? Ack, this is why the maple tree is so critical here as it's 'RCU-friendly'. Have updated. > > > +VMA lock it via :c:func:`!vma_start_read`, before releasing the RCU lock via > > +:c:func:`!rcu_read_unlock`. > > + > > +VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for > > +their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it > > +via :c:func:`!vma_end_read`. > > + > > +VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a > > +VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always > > +acquired. An mmap write lock **must** be held for the duration of the VMA write > > +lock, releasing or downgrading the mmap write lock also releases the VMA write > > +lock so there is no :c:func:`!vma_end_write` function. > > + > > +Note that a semaphore write lock is not held across a VMA lock. Rather, a > > +sequence number is used for serialisation, and the write semaphore is only > > +acquired at the point of write lock to update this. > > + > > +This ensures the semantics we require - VMA write locks provide exclusive write > > +access to the VMA. > > + > > +The VMA lock mechanism is designed to be a lightweight means of avoiding the use > > +of the heavily contended mmap lock. It is implemented using a combination of a > > +read/write semaphore and sequence numbers belonging to the containing > > +:c:struct:`!struct mm_struct` and the VMA. > > + > > +Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic > > +operation, i.e. it tries to acquire a read lock but returns false if it is > > +unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is > > +called to release the VMA read lock. This can be done under RCU alone. > > Please clarify what "This" refers to, and whether the part about RCU > is explaining an implementation detail or the API contract. Ack have added a chonky comment on this. > > > + > > +Writing requires the mmap to be write-locked and the VMA lock to be acquired via > > +:c:func:`!vma_start_write`, however the write lock is released by the termination or > > +downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required. > > + > > +All this is achieved by the use of per-mm and per-VMA sequence counts, which are > > +used in order to reduce complexity, especially for operations which write-lock > > +multiple VMAs at once. > > + > > +If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA > > +sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If > > +they differ, then they are not. > > nit: "it is not"? Ack, fixed. > > > + > > +Each time an mmap write lock is acquired in :c:func:`!mmap_write_lock`, > > +:c:func:`!mmap_write_lock_nested`, :c:func:`!mmap_write_lock_killable`, the > > +:c:member:`!mm->mm_lock_seq` sequence number is incremented via > > +:c:func:`!mm_lock_seqcount_begin`. > > + > > +Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or > > +:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which > > +also increments :c:member:`!mm->mm_lock_seq` via > > +:c:func:`!mm_lock_seqcount_end`. > > + > > +This way, we ensure regardless of the VMA's sequence number count, that a write > > +lock is not incorrectly indicated (since we increment the sequence counter on > > +acquiring the mmap write lock, which is required in order to obtain a VMA write > > +lock), and that when we release an mmap write lock, we efficiently release > > +**all** VMA write locks contained within the mmap at the same time. > > Incrementing on mmap_write_lock() is not necessary for VMA locks; that > part is for future seqlock-style users of the MM sequence count that > want to work without even taking the VMA lock, with the new > mmap_lock_speculation_{begin,end} API. See commit db8f33d3b7698 and > the thread https://lore.kernel.org/linux-mm/20241010205644.3831427-5-andrii@xxxxxxxxxx/ > . Right, was aware of that part but thought we'd want to increment anyway, however I suppose given you increment on lock _release_ it isn't necessary. We will be extending this section (or... Suren maybe will? ;)