On Tue, Nov 12, 2024 at 10:15:44AM -0500, Liam R. Howlett wrote: > * Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> [241108 08:57]: [snip] > > +Each mm object contains a maple tree data structure which describes all VMAs > > +within the virtual address space. > > + > > +.. note:: An exception to this is the 'gate' VMA which is provided by > > + architectures which use :c:struct:`!vsyscall` and is a global static > > + object which does not belong to any specific mm. > > vvars too? I'm not sure if that's the case? For instance for x86-64 we have: /* * A pseudo VMA to allow ptrace access for the vsyscall page. This only * covers the 64bit vsyscall page now. 32bit has a real VMA now and does * not need special handling anymore: */ static const char *gate_vma_name(struct vm_area_struct *vma) { return "[vsyscall]"; } With no reference to vvar. Also vsyscall exists in a kernel range, and vvar in a userland mapping (along with the vdso). I am not hugely expert about this area however and may be missing something. [snip] > > + > > +VMA locks are special in that you must obtain an mmap **write** lock **first** > > +in order to obtain a VMA **write** lock. A VMA **read** lock however can be > > +obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then > > +release an RCU lock to lookup the VMA for you). > > This reduces the impact of a writer on readers by only impacting > conflicting areas of the vma tree. > > > + > > +.. note:: The primary users of VMA read locks are page fault handlers, which > > + means that without a VMA write lock, page faults will run concurrent with > > + whatever you are doing. > > This is the primary user in that it's the most frequent, but as we > unwind other lock messes it is becoming a pattern. > > Maybe "the most frequent users" ? > > I mean the only other users (at least in 6.11.y) are uffd and ipv4 tcp, so I think 'primary' is fine, this doesn't preclude other users but I think 'the most frequent users' vs. 'primary users' is likely to be equally outdated should we move sufficient other stuff to using VMA locks (and then docs can be updated too :) [snip] > > +.. table:: Config-specific fields > > + > > + ================================= ===================== ======================================== =============== > > + Field Configuration option Description Write lock > > + ================================= ===================== ======================================== =============== > > + :c:member:`!anon_name` CONFIG_ANON_VMA_NAME A field for storing a mmap write, > > + :c:struct:`!struct anon_vma_name` VMA write. > > + object providing a name for anonymous > > + mappings, or :c:macro:`!NULL` if none > > + is set or the VMA is file-backed. > > These are ref counted and can be shared by more than one vma for > scalability. I'm talking about the field, I think this can be treated as an implementation detail, perhaps? In this opening section I tried to avoid too much detail as v1 of this series sort of went all-in and people (rightly!) raised the issue that you couldn't pull usable information out of it right away and it was too weighted towards internal mm kernel devs. I can add a little thing about it here if you want but I feel probably not necessary? > > > + :c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism mmap read, > > + to perform readahead. This field is swap-specific > > + accessed atomically. lock. > > + :c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write, > > + describes the NUMA behaviour of the VMA write. > > + VMA. > > These are also ref counted for scalability. See above. > > > + :c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which mmap read, > > + describes the current state of numab-specific > > + NUMA balancing in relation to this VMA. lock. > > + Updated under mmap read lock by > > + :c:func:`!task_numa_work`. > > + :c:member:`!vm_userfaultfd_ctx` CONFIG_USERFAULTFD Userfaultfd context wrapper object of mmap write, > > + type :c:type:`!vm_userfaultfd_ctx`, VMA write. > > + either of zero size if userfaultfd is > > + disabled, or containing a pointer > > + to an underlying > > + :c:type:`!userfaultfd_ctx` object which > > + describes userfaultfd metadata. > > + ================================= ===================== ======================================== =============== [snip] > > +There is also a file-system specific lock ordering comment located at the top of > > +:c:macro:`!mm/filemap.c`: > > + > > +.. code-block:: > > + > > + ->i_mmap_rwsem (truncate_pagecache) > > + ->private_lock (__free_pte->block_dirty_folio) > > + ->swap_lock (exclusive_swap_page, others) > > + ->i_pages lock > > + > > + ->i_rwsem > > + ->invalidate_lock (acquired by fs in truncate path) > > + ->i_mmap_rwsem (truncate->unmap_mapping_range) > > + > > + ->mmap_lock > > + ->i_mmap_rwsem > > + ->page_table_lock or pte_lock (various, mainly in memory.c) > > + ->i_pages lock (arch-dependent flush_dcache_mmap_lock) > > + > > + ->mmap_lock > > + ->invalidate_lock (filemap_fault) > > + ->lock_page (filemap_fault, access_process_vm) > > + > > + ->i_rwsem (generic_perform_write) > > + ->mmap_lock (fault_in_readable->do_page_fault) > > + > > + bdi->wb.list_lock > > + sb_lock (fs/fs-writeback.c) > > + ->i_pages lock (__sync_single_inode) > > + > > + ->i_mmap_rwsem > > + ->anon_vma.lock (vma_merge) > > + > > + ->anon_vma.lock > > + ->page_table_lock or pte_lock (anon_vma_prepare and various) > > + > > + ->page_table_lock or pte_lock > > + ->swap_lock (try_to_unmap_one) > > + ->private_lock (try_to_unmap_one) > > + ->i_pages lock (try_to_unmap_one) > > + ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) > > + ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) > > + ->private_lock (folio_remove_rmap_pte->set_page_dirty) > > + ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) > > + bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) > > + ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty) > > + bdi.wb->list_lock (zap_pte_range->set_page_dirty) > > + ->inode->i_lock (zap_pte_range->set_page_dirty) > > + ->private_lock (zap_pte_range->block_dirty_folio) > > + > > +Please check the current state of these comments which may have changed since > > +the time of writing of this document. > > hugetlbfs has its own locking and is out of scope. > Yeah, I've generally avoided this here as kinda implicitly out of scope, I mean hugetlbfs has its own VMA locking too. I think somebody reading this document interested in hugetlbfs internals would kinda get the point that the lack of reference to it implies we don't cover it. But can add a note somewhere in doc to say 'except hugetlbfs'? :>) > > + > > +------------------------------ > > +Locking Implementation Details > > +------------------------------ > > + > > +Page table locking details > > +-------------------------- > > + > > +In addition to the locks described in the terminology section above, we have > > +additional locks dedicated to page tables: > > + > > +* **Higher level page table locks** - Higher level page tables, that is PGD, P4D > > + and PUD each make use of the process address space granularity > > + :c:member:`!mm->page_table_lock` lock when modified. > > + > > +* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks > > + either kept within the folios describing the page tables or allocated > > + separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is > > + set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are > > + mapped into higher memory (if a 32-bit system) and carefully locked via > > + :c:func:`!pte_offset_map_lock`. > > + > > +These locks represent the minimum required to interact with each page table > > +level, but there are further requirements. > > + > > +Importantly, note that on a **traversal** of page tables, no such locks are > > +taken. Whether care is taken on reading the page table entries depends on the > > +architecture, see the section on atomicity below. > > + > > +Locking rules > > +^^^^^^^^^^^^^ > > + > > +We establish basic locking rules when interacting with page tables: > > + > > +* When changing a page table entry the page table lock for that page table > > + **must** be held, except if you can safely assume nobody can access the page > > + tables concurrently (such as on invocation of :c:func:`!free_pgtables`). > > +* Reads from and writes to page table entries must be *appropriately* > > + atomic. See the section on atomicity below for details. > > +* Populating previously empty entries requires that the mmap or VMA locks are > > + held (read or write), doing so with only rmap locks would be dangerous (see > > + the warning below). > > Which is the rmap lock? It's not listed as rmap lock in the rmap file. > I explicitly describe what the rmap locks are in the terminology section. It's just shorthand so I don't have to say 'i_mmap_rwsem read, write, mmap_lock read, write locks' over + over again :>) I know this was a bugbear of yours with recent stuff relating to these locks and I totally get that it's annoying to have this sort of non standard way of referring to them but I somewhat gave in. I sort of have a (crazy) desire to do something about this separation at some point and make it one lock by eliminating the anon_vma stuff... but that's perhaps an RFC in the future... [snip] > > +If a THP collapse (or similar) were to occur then the lock on both pages would > > +be acquired, so we can ensure this is prevented while the PTE lock is held. > > + > > +Installing entries this way ensures mutual exclusion on write. > > + > > I stopped here, but missed the v1 comment time so I'm sending this now. > ... Thanks, appreciate your feedback!