> On Dec 21, 2020, at 2:30 PM, Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote: >> BTW: In general, I think that you are right, and that changing of PTEs >> should not require taking mmap_lock for write. However, I am not sure >> cow_user_page() is not the only one that poses a problem and whether a more >> systematic solution is needed. If cow_user_pages() is the only problem, do >> you think it is possible to do the copying while holding the PTL? It works >> for normal-pages, but I am not sure whether special-pages pose special >> problems. >> >> Anyhow, this is an enhancement that we can try later. > > AFAIU mprotect() is the only one who modifies the pte using the mmap write > lock. NUMA balancing is also using read mmap lock when changing pte > protections, while my understanding is mprotect() used write lock only because > it manipulates the address space itself (aka. vma layout) rather than modifying > the ptes, so it needs to. You are correct about NUMA balancing in general. Yet in practice it is not an issue in our “use-case” since NUMA balancing preserves the write-bit. > At the pte level, it seems always to be the pgtable lock that serializes things. > > So it's perfectly legal to me for e.g. a driver to modify ptes with the read > lock of mmap_sem, unless I'm severely mistaken.. as long as the pgtable lock is > taken when doing so. > > If there's a driver that manipulated the ptes, changed the content of the page, > recover the ptes to origin, and all these happen right after wp_page_copy() > unlocked the pgtable lock but before wp_page_copy() retakes the same lock > again, we may face the same issue finding that the page got copied contains > corrupted data at last. While I don't know what to blame on the driver either > because it seems to be exactly following the rules. The driver would have to do so without flushing the TLB. Having said that, the driver could have used inc_tlb_flush_pending() and batch flushes. > > I believe changing into write lock would solve the race here because tlb > flushing would be guaranteed along the way, but I'm just a bit worried it's not > the best way to go.. It might be too big of a hammer. But the question that comes to my mind is, if it is ok to change the PTEs without mmap_lock held for write, why wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so mprotect change of PTE would not be done with the write-lock)? If you did so, you would have the same problem as the one we encountered (concurrent protect-unprotect allow concurrent cow-#PF copying the wrong data). So as an alternative solution, I can do copying under the PTL after flushing, which seems to solve the problem. First copying (without a lock) and then comparing seems to me as suboptimal - I do not think the benefit (if there is one) of shortening the time in which the lock is taken - worth the additional compare (and the complexity with special pages). There are 2 problems in doing so: 1. I think that copy_user_highpage() and __copy_from_user_inatomic() can be called while holding the PTL, but I am not sure. 2. For special pages we would need 2 TLB flushes: one to ensure the write-bit is cleared, and a second one after we clear the PTE. We can limit ourselves to soft-dirty/UFFD VMAs, but if we have your hypothetical driver, this would not be good enough.