> On Dec 20, 2020, at 1:54 AM, Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote: >>> On Dec 19, 2020, at 10:05 PM, Yu Zhao <yuzhao@xxxxxxxxxx> wrote: >>> >>> On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote: >>>> [ cc’ing some more people who have experience with similar problems ] >>>> >>>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: >>>>> >>>>> Hello, >>>>> >>>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote: >>>>>> Analyzing this problem indicates that there is a real bug since >>>>>> mmap_lock is only taken for read in mwriteprotect_range(). This might >>>>> >>>>> Never having to take the mmap_sem for writing, and in turn never >>>>> blocking, in order to modify the pagetables is quite an important >>>>> feature in uffd that justifies uffd instead of mprotect. It's not the >>>>> most important reason to use uffd, but it'd be nice if that guarantee >>>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the >>>>> other pgtable manipulations. >>>>> >>>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown): >>>>>> >>>>>> cpu0 cpu1 >>>>>> ---- ---- >>>>>> userfaultfd_writeprotect() >>>>>> [ write-protecting ] >>>>>> mwriteprotect_range() >>>>>> mmap_read_lock() >>>>>> change_protection() >>>>>> change_protection_range() >>>>>> ... >>>>>> change_pte_range() >>>>>> [ defer TLB flushes] >>>>>> userfaultfd_writeprotect() >>>>>> mmap_read_lock() >>>>>> change_protection() >>>>>> [ write-unprotect ] >>>>>> ... >>>>>> [ unprotect PTE logically ] >>>>>> ... >>>>>> [ page-fault] >>>>>> ... >>>>>> wp_page_copy() >>>>>> [ set new writable page in PTE] >>> >>> I don't see any problem in this example -- wp_page_copy() calls >>> ptep_clear_flush_notify(), which should take care of the stale entry >>> left by cpu0. >>> >>> That being said, I suspect the memory corruption you observed is >>> related this example, with cpu1 running something else that flushes >>> conditionally depending on pte_write(). >>> >>> Do you know which type of pages were corrupted? file, anon, etc. >> >> First, Yu, you are correct. My analysis is incorrect, but let me have >> another try (below). To answer your (and Andrea’s) question - this happens >> with upstream without any changes, excluding a small fix to the selftest, >> since it failed (got stuck) due to missing wake events. [1] >> >> We are talking about anon memory. >> >> So to correct myself, I think that what I really encountered was actually >> during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). The >> problem was that in this case the “write”-bit was removed during unprotect. > > Thanks. You are right about when the problem happens: UFD write- > UNprotecting. But it's not UFD write-UNprotecting that removes the > writable bit -- the bit can only be removed during COW or UFD > write-protecting. So your original example was almost correct, except > the last line describing cpu1. The scenario is a bit confusing, so stay with me. The idea behind uffd unprotect is indeed only to mark the PTE logically as uffd-unprotected, and not to *set* the writable bit, allowing the #PF handler to do COW or whatever correctly upon #PF. However, the problem that we have is that if a page is already writable, write-unprotect *clears* the writable bit, making it write-protected (at least for anonymous pages). This is not good from performance point-of-view, but also a correctness issue, as I pointed out. In some more detail: mwriteprotect_range() uses vm_get_page_prot() to compute the new protection. For anonymous private memory, at least on x86, this means the write-bit in the protection is clear. So later, change_pte_range() *clears* the write-bit during *unprotection*. That’s the reason the second part of my patch - the change to preserve_write - fixes the problem. > The problem is how do_wp_page() handles non-COW pages. (For COW pages, > do_wp_page() works correctly by either reusing an existing page or > make a new copy out of it.) In UFD case, the existing page may not > have been properly write-protected. As you pointed out, the tlb flush > may not be done yet. Making a copy can potentially race with the > writer on cpu2. Just to clarify the difference - You regard a scenario of UFFD write-protect, while I am pretty sure the problem I encountered is during write-unprotect. I am not sure we are on the same page (but we may be). The problem I have is with cow_user_page() that is called by do_wp_page() before any TLB flush took place (either by change_protection_range() or by do_wp_page() which does flush, but after the copy). Let me know if you regard a different scenario. > Should we fix the problem by ensuring integrity of the copy? IMO, no, > because do_wp_page() shouldn't copy at all in this case. It seems it > was recently broken by > > be068f29034f mm: fix misplaced unlock_page in do_wp_page() > 09854ba94c6a mm: do_wp_page() simplification > > I haven't study them carefully. But if you could just revert them and > run the test again, we'd know where exactly to look at next. These patches regard the wp_page_reuse() case, which makes me think we are not on the same page. I do not see a problem with wp_page_reuse() since it does not make a copy of the page. If you can explain what I am missing, it would be great.