> On Jan 5, 2021, at 12:39 PM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > > On Tue, Jan 05, 2021 at 07:26:43PM +0000, Nadav Amit wrote: >>> On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: >>> >>> On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: >>>> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking") >>> >>> Targeting a backport down to 2013 when nothing could wrong in practice >>> with page_mapcount sounds backwards and unnecessarily risky. >>> >>> In theory it was already broken and in theory >>> 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the >>> previous code of 2013 is completely wrong, but in practice the code >>> from 2013 worked perfectly until Aug 21 2020. >> >> Well… If you consider the bug that Will recently fixed [1], then soft-dirty >> was broken (for a different, yet related reason) since 0758cd830494 >> ("asm-generic/tlb: avoid potential double flush”). >> >> This is not to say that I argue that the patch should be backported to 2013, >> just to say that memory corruption bugs can be unnoticed. >> >> [1] https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@xxxxxxxxxx/ > > Is this a fix or a cleanup? > > The above is precisely what I said earlier that tlb_gather had no > reason to stay in clear_refs and it had to use inc_tlb_flush_pending > as mprotect, but it's not a fix? Is it? I suggested it as a pure > cleanup. So again no backport required. The commit says fix this but > it means "clean this up". It is actually a fix. I think the commit log is not entirely correct and should include: Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush”). Since 0758cd830494, calling tlb_finish_mmu() without any previous call to pte_free_tlb() and friends does not flush the TLB. The soft-dirty bug producer that I sent fails without this patch of Will. > So setting a Fixes back to 2013 that would go mess with all stable > tree by actively backporting a performance regressions to clear_refs > that can break runtime performance to fix a philosophical issue that > isn't even a theoretical issue, doesn't sound ideal to me. Point taken. > >> To summarize my action items based your (and others) feedback on both >> patches: >> >> 1. I will break the first patch into two different patches, one with the >> “optimization” for write-unprotect, based on your feedback. It will not >> be backported. >> >> 2. I will try to add a patch to avoid TLB flushes on >> userfaultfd-writeunprotect. It will also not be backported. > > I think 1 and 2 above could be in the same patch. Mixing an uffd-wp optimization with the > actual fix the memory corruption wasn't ideal, but doing the same > optimization to both wrprotect and un-wrprotect in the same patch > sounds ideal. The commit explanation would be identical and it can be > de-duplicated this way. > > I'd suggest to coordinate with Peter on that, since I wasn't planning > to work on this if somebody else offered to do it. > >> 3. Let me know if you want me to use your version of testing >> mm_tlb_flush_pending() and conditionally flushing, wait for new version fro >> you or Peter or to go with taking mmap_lock for write. > > Yes, as you suggested, I'm trying to clean it up and send a new > version. > > Ultimately my view is there are an huge number of cases where > mmap_write_lock or some other heavy lock that will require > occasionally to block on I/O is beyond impossible not to take. Even > speculative page faults only attack the low hanging anon memory and > there's still MADV_DONTNEED/FREE and other stuff that may have to run > in parallel with UFFDIO_WRITEPROTECT and clear_refs, not just page > faults. > > As a reminder: the only case when modifying the vmas is allowed under > mmap_read_lock (I already tried once to make it safer by adding > READ_ONCE/WRITE_ONCE but wasn't merged see > https://www.spinics.net/lists/linux-mm/msg173420.html), is when > updating vm_end/vm_start in growsdown/up, where the vma is extended > down or up in the page fault under only mmap_read_lock. > > I'm doing all I can to document and make it more explicit the > complexity we deal with in the code (as well as reducing the gcc > dependency in emitting atomic writes to update vm_end/vm_start, as we > should do in ptes as well in theory). As you may notice in the > feedback from the above submission not all even realized that we're > modifying vmas already under mmap_read_lock. So it'd be great to get > help to merge that READ_ONCE/WRITE_ONCE cleanup that is still valid > and pending for merge but it needs forward porting. > > This one, for both soft dirty and uffd_wrprotect, is a walk in the > park to optimize in comparison to the vma modifications. I am sure you are right. > > From my point of view in fact, doing the tlb flush or the wait on the > atomic to be released, does not increase kernel complexity compared to > what we had until now. It is also about performance due to unwarranted TLB flushes. I think avoiding them requires some finer granularity detection of pending page-faults. But anyhow, I still owe some TLB optimization patches (and v2 for userfaultfd+iouring) before I can even look at that. In addition, as I stated before, having some clean interfaces that tell whether a TLB flush is needed or not would be helpful and simpler to follow. For instance, we can have is_pte_prot_demotion(oldprot, newprot) to figure out whether a TLB flush is needed in change_pte_range() and avoid unnecessary flushes when unprotecting pages with either mprotect() or userfaultfd.