> On Dec 21, 2020, at 11:55 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Mon, Dec 21, 2020 at 11:16 AM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: >> Nadav Amit found memory corruptions when running userfaultfd test above. >> It seems to me the problem is related to commit 09854ba94c6a ("mm: >> do_wp_page() simplification"). Can you please take a look? Thanks. >> >> TL;DR: it may not safe to make copies of singly mapped (non-COW) pages >> when it's locked or has additional ref count because concurrent >> clear_soft_dirty or change_pte_range may have removed pte_write but yet >> to flush tlb. > > Hmm. The TLB flush shouldn't actually matter, because anything that > changes the writable bit had better be serialized by the page table > lock. > > Yes, we often load the page table value without holding the page table > lock (in order to know what we are going to do), but then before we > finalize the operation, we then re-check - undet the page table lock - > that the value we loaded still matches. > > But I think I see what *MAY* be going on. The userfaultfd > mwriteprotect_range() code takes the mm lock for _reading_. Which > means that you can have > > Thread A Thread B > > - fault starts. Sees write-protected pte, allocates memory, copies data > > - userfaultfd makes the regions writable > > - usefaultfd case writes to the region > > - userfaultfd makes region non-writable > > - fault continues, gets the page table lock, sees that the pte is the > same, uses old copied data > > But if this is what's happening, I think it's a userfaultfd bug. I > think the mmap_read_lock(dst_mm) in mwriteprotect_range() needs to be > a mmap_write_lock(). > > mprotect() does this right, it looks like userfaultfd does not. You > cannot just change the writability of a page willy-nilly without the > correct locking. > > Maybe there are other causes, but this one stands out to me as one > possible cause. > > Comments? Using mmap_write_lock() was my initial fix and there was a strong pushback on this approach due to its potential impact on performance. So I think there are 4 possible directions for solutions that I thought of or others have raised/hinted: 1. mmap_write_lock() 2. Copy the page in cow_user_page() while holding the PTL and after flushing has been done. I am not sure if there are potential problems with special-pages (2 flushes might be necessary for special pages). 3. Always reuse the page and never COW on userfaultfd/soft-dirty. I do not know if it is feasible. 4. Retry the page-fault if mm->tlb_flush_pending is set. Hopefully I did not miss any other suggestion.