On Fri, Nov 08, 2024 at 10:26:03AM +0200, Kirill A. Shutemov wrote: > On Thu, Nov 07, 2024 at 07:01:37PM +0000, Lorenzo Stoakes wrote: > > +.. table:: Config-specific fields > > + > > + ================================= ===================== ======================================== =============== > > + Field Configuration option Description Write lock > > + ================================= ===================== ======================================== =============== > > + :c:member:`!anon_name` CONFIG_ANON_VMA_NAME A field for storing a mmap write, > > + :c:struct:`!struct anon_vma_name` VMA write. > > + object providing a name for anonymous > > + mappings, or :c:macro:`!NULL` if none > > + is set or the VMA is file-backed. > > + :c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism mmap read. > > + to perform readahead. > > It is not clear how writes to the field is serialized by a shared lock. > Yes I think there is a swap-specific lock, but I'm not sure it's worth confusing matters by including that here, maybe here and for numab I will just wave my hands for that bit. > It worth noting that it is atomic. > Will add. > > + :c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write, > > + describes the NUMA behaviour of the VMA write. > > + VMA. > > + :c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which mmap read. > > + describes the current state of > > + NUMA balancing in relation to this VMA. > > + Updated under mmap read lock by > > + :c:func:`!task_numa_work`. > > Again, shared lock serializing writes make zero sense. There's other > mechanism in play. > > I believe there's some kind of scheduler logic that excludes parallel > updates for the same process. But I cannot say I understand this. Ack, agreed, see above, hand waving probably required :) > > > + :c:member:`!vm_userfaultfd_ctx` CONFIG_USERFAULTFD Userfaultfd context wrapper object of mmap write, > > + type :c:type:`!vm_userfaultfd_ctx`, VMA write. > > + either of zero size if userfaultfd is > > + disabled, or containing a pointer > > + to an underlying > > + :c:type:`!userfaultfd_ctx` object which > > + describes userfaultfd metadata. > > + ================================= ===================== ======================================== =============== > > ... > > > +Lock ordering > > +------------- > > + > > +As we have multiple locks across the kernel which may or may not be taken at the > > +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and > > +the **order** in which locks are acquired and released becomes very important. > > + > > +.. note:: Lock inversion occurs when two threads need to acquire multiple locks, > > + but in doing so inadvertently cause a mutual deadlock. > > + > > + For example, consider thread 1 which holds lock A and tries to acquire lock B, > > + while thread 2 holds lock B and tries to acquire lock A. > > + > > + Both threads are now deadlocked on each other. However, had they attempted to > > + acquire locks in the same order, one would have waited for the other to > > + complete its work and no deadlock would have occurred. > > + > > +The opening comment in `mm/rmap.c` describes in detail the required ordering of > > +locks within memory management code: > > + > > +.. code-block:: > > + > > + inode->i_rwsem (while writing or truncating, not reading or faulting) > > + mm->mmap_lock > > + mapping->invalidate_lock (in filemap_fault) > > + folio_lock > > + hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below) > > + vma_start_write > > + mapping->i_mmap_rwsem > > + anon_vma->rwsem > > + mm->page_table_lock or pte_lock > > + swap_lock (in swap_duplicate, swap_info_get) > > + mmlist_lock (in mmput, drain_mmlist and others) > > + mapping->private_lock (in block_dirty_folio) > > + i_pages lock (widely used) > > + lruvec->lru_lock (in folio_lruvec_lock_irq) > > + inode->i_lock (in set_page_dirty's __mark_inode_dirty) > > + bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) > > + sb_lock (within inode_lock in fs/fs-writeback.c) > > + i_pages lock (widely used, in set_page_dirty, > > + in arch-dependent flush_dcache_mmap_lock, > > + within bdi.wb->list_lock in __sync_single_inode) > > + > > +Please check the current state of this comment which may have changed since the > > +time of writing of this document. > > I think we need one canonical place for this information. Maybe it worth > moving it here from rmap.c? There's more locking ordering info in filemap.c. Re: canonical place - yes I agree, once this doc goes I can follow up with a patch that replaces the comment with a link to the official kernel.org latest docs etc.? That would also allow us to render this more nicely perhaps, in future. Re: mm/filemap.c - these are getting into file system-specific stuff so not sure if the right place but there's a lot of overlap, maybe worth importing anyway. > > -- > Kiryl Shutsemau / Kirill A. Shutemov Thanks!