On Mon, Dec 21, 2020 at 3:22 PM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Mon, Dec 21, 2020 at 2:30 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > AFAIU mprotect() is the only one who modifies the pte using the mmap write > > lock. NUMA balancing is also using read mmap lock when changing pte > > protections, while my understanding is mprotect() used write lock only because > > it manipulates the address space itself (aka. vma layout) rather than modifying > > the ptes, so it needs to. > > So it's ok to change the pte holding only the PTE lock, if it's a > *one*way* conversion. > > That doesn't break the "re-check the PTE contents" model (which > predates _all_ of the rest: NUMA, userfaultfd, everything - it's > pretty much the original model for our page table operations, and goes > back to the dark ages even before SMP and the existence of a page > table lock). > > So for example, a COW will always create a different pte (not just > because the page number itself changes - you could imagine a page > getting re-used and changing back - but because it's always a RO->RW > transition). > > So two COW operations cannot "undo" each other and fool us into > thinking nothing changed. > > Anything that changes RW->RO - like fork(), for example - needs to > take the mmap_lock. Ugh, this is unpleasantly complicated. I will admit that any API that takes an address and more-or-less-blindly marks it RO makes me quite nervous even assuming all the relevant locks are held. At least userfaultfd refuses to operate on VM_SHARED VMAs, but we have another instance of this (with mmap_sem held for write) in x86: mark_screen_rdonly(). Dare I ask how broken this is? We could likely get away with deleting it entirely. --Andy