> On Dec 21, 2020, at 7:19 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > On Mon, Dec 21, 2020 at 3:22 PM Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: >> On Mon, Dec 21, 2020 at 2:30 PM Peter Xu <peterx@xxxxxxxxxx> wrote: >>> AFAIU mprotect() is the only one who modifies the pte using the mmap write >>> lock. NUMA balancing is also using read mmap lock when changing pte >>> protections, while my understanding is mprotect() used write lock only because >>> it manipulates the address space itself (aka. vma layout) rather than modifying >>> the ptes, so it needs to. >> >> So it's ok to change the pte holding only the PTE lock, if it's a >> *one*way* conversion. >> >> That doesn't break the "re-check the PTE contents" model (which >> predates _all_ of the rest: NUMA, userfaultfd, everything - it's >> pretty much the original model for our page table operations, and goes >> back to the dark ages even before SMP and the existence of a page >> table lock). >> >> So for example, a COW will always create a different pte (not just >> because the page number itself changes - you could imagine a page >> getting re-used and changing back - but because it's always a RO->RW >> transition). >> >> So two COW operations cannot "undo" each other and fool us into >> thinking nothing changed. >> >> Anything that changes RW->RO - like fork(), for example - needs to >> take the mmap_lock. > > Ugh, this is unpleasantly complicated. I will admit that any API that > takes an address and more-or-less-blindly marks it RO makes me quite > nervous even assuming all the relevant locks are held. At least > userfaultfd refuses to operate on VM_SHARED VMAs, but we have another > instance of this (with mmap_sem held for write) in x86: > mark_screen_rdonly(). Dare I ask how broken this is? We could likely > get away with deleting it entirely. If you only look at the function in isolation, it seems broken. It should have flushed the TLB before releasing the mmap_lock. After the mmap_write_unlock() and before the actual flush, a #PF on another thread can happen, and a similar scenario to the one that is mentioned in this thread (copying while a stale PTE in the TLBs is not-writeprotected) might happen. Having said that, I do not know this code and the context in which this function is called, so I do not know whether there are other mitigating factors. Funny, I had a deja-vu and indeed you have already raised (other) TLB issues with mark_screen_rdonly() 3 years ago. At the time you said "I'd like to delete it.” [1] [1] https://lore.kernel.org/patchwork/patch/782486/#976151