On Fri, Nov 1, 2024 at 7:50 PM Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote: > > Locking around VMAs is complicated and confusing. While we have a number of > disparate comments scattered around the place, we seem to be reaching a > level of complexity that justifies a serious effort at clearly documenting > how locks are expected to be interacted with when it comes to interacting > with mm_struct and vm_area_struct objects. > > This is especially pertinent as regards efforts to find sensible > abstractions for these fundamental objects within the kernel rust > abstraction whose compiler strictly requires some means of expressing these > rules (and through this expression can help self-document these > requirements as well as enforce them which is an exciting concept). > > The document limits scope to mmap and VMA locks and those that are > immediately adjacent and relevant to them - so additionally covers page > table locking as this is so very closely tied to VMA operations (and relies > upon us handling these correctly). > > The document tries to cover some of the nastier and more confusing edge > cases and concerns especially around lock ordering and page table teardown. > > The document also provides some VMA lock internals, which are up to date > and inclusive of recent changes to recent sequence number changes. > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> [...] > +Page table locks > +---------------- > + > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above > +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is > +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively. > + > +.. note:: > + `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn, > + however at the time of writing it ultimately references the > + `mm->page_table_lock`. > + > +Allocating a PTE will either use the `mm->page_table_lock` or, if > +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical > +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()` > +called from `pmd_lock()` and ultimately `__pte_alloc()`. > + > +Finally, modifying the contents of the PTE has special treatment, as this is a > +lock that we must acquire whenever we want stable and exclusive access to > +entries pointing to data pages within a PTE, especially when we wish to modify > +them. > + > +This is performed via `pte_offset_map_lock()` which carefully checks to ensure > +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()` > +to obtain a spin lock at PTE granularity contained within the `struct ptdesc` > +associated with the physical PTE page. The lock must be released via > +`pte_unmap_unlock()`. > + > +.. note:: > + There are some variants on this, such as `pte_offset_map_rw_nolock()` when we > + know we hold the PTE stable but for brevity we do not explore this. > + See the comment for `__pte_offset_map_lock()` for more details. > + > +When modifying data in ranges we typically only wish to allocate higher page > +tables as necessary, using these locks to avoid races or overwriting anything, > +and set/clear data at the PTE level as required (for instance when page faulting > +or zapping). Speaking as someone who doesn't know the internals at all ... this section doesn't really answer any questions I have about the page table. It looks like this could use an initial section about basic usage, and the detailed information could come after? Concretely, if I wish to call vm_insert_page or zap some pages, what are the locking requirements? What if I'm writing a page fault handler? Alice