On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote: > BTW: In general, I think that you are right, and that changing of PTEs > should not require taking mmap_lock for write. However, I am not sure > cow_user_page() is not the only one that poses a problem and whether a more > systematic solution is needed. If cow_user_pages() is the only problem, do > you think it is possible to do the copying while holding the PTL? It works > for normal-pages, but I am not sure whether special-pages pose special > problems. > > Anyhow, this is an enhancement that we can try later. AFAIU mprotect() is the only one who modifies the pte using the mmap write lock. NUMA balancing is also using read mmap lock when changing pte protections, while my understanding is mprotect() used write lock only because it manipulates the address space itself (aka. vma layout) rather than modifying the ptes, so it needs to. At the pte level, it seems always to be the pgtable lock that serializes things. So it's perfectly legal to me for e.g. a driver to modify ptes with the read lock of mmap_sem, unless I'm severely mistaken.. as long as the pgtable lock is taken when doing so. If there's a driver that manipulated the ptes, changed the content of the page, recover the ptes to origin, and all these happen right after wp_page_copy() unlocked the pgtable lock but before wp_page_copy() retakes the same lock again, we may face the same issue finding that the page got copied contains corrupted data at last. While I don't know what to blame on the driver either because it seems to be exactly following the rules. I believe changing into write lock would solve the race here because tlb flushing would be guaranteed along the way, but I'm just a bit worried it's not the best way to go.. Thanks, -- Peter Xu