On Wed, Dec 23, 2020 at 09:00:26PM -0500, Andrea Arcangeli wrote: > One other near zero cost improvement easy to add if this would be "if > (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made The next worry then is if UFFDIO_WRITEPROTECT is very large then there would be a flood of wrprotect faults, and they'd end up all issuing a tlb flush during the UFFDIO_WRITEPROTECT itself which again is a performance concern for certain uffd-wp use cases. Those use cases would be for example redis and postcopy live snapshotting, to use it for async snapshots, unprivileged too in the case of redis if it temporarily uses bounce buffers for the syscall I/O for the duration of the snapshot. hypervisors tuned profiles need to manually lift the unprivileged_userfaultfd to 1 unless their jailer leaves one capability in the snapshot thread. Moving the check after userfaultfd_pte_wp would solve userfaultfd_writeprotect(mode_wp=true), but that still wouldn't avoid a flood of tlb flushes during userfaultfd_writeprotect(mode_wp=false) because change_protection doesn't restore the pte_write: } else if (uffd_wp_resolve) { /* * Leave the write bit to be handled * by PF interrupt handler, then * things like COW could be properly * handled. */ ptent = pte_clear_uffd_wp(ptent); } When the snapshot is complete userfaultfd_writeprotect(mode_wp=false) would need to run again on the whole range which can be very big again. Orthogonally I think we should also look to restore the pte_write above orthogonally in uffd-wp, so it'll get yet an extra boost if compared to current redis snapshotting fork(), that cannot restore all pte_write after the snapshot child quit and forces a flood of spurious wrprotect faults (uffd-wp can solve that too). However, even if uffd-wp restored the pte_write, things would remain suboptimal for a terabyte process under clear_refs, since softdirty wrprotect faults that start happening while softdirty is still running on the mm, won't be caught in userfaultfd_pte_wp. Something like below, if cleaned up, abstracted properly and documented well in the two places involved, will have a better chance to perform optimally for softdirty too. And on a side note the CONFIG_MEM_SOFT_DIRTY compile time check is compulsory because VM_SOFTDIRTY is defined to zero if softdirty is not built in. (for VM_UFFD_WP the CONFIG_HAVE_ARCH_USERFAULTFD_WP can be removed and it won't make any measurable difference even when USERFAULTFD=n) RFC untested below, it's supposed to fix the softdirty testcase too, even without the incremental fix, since it already does tlb_gather_mmu before walk_page_range and tlb_finish_mmu after it and that appears enough to define the inc/dec_tlb_flush_pending. diff --git a/mm/memory.c b/mm/memory.c index 7d608765932b..66fd6d070c47 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2844,11 +2844,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (!new_page) goto oom; } else { + bool in_uffd_wp, in_softdirty; new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); if (!new_page) goto oom; +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP + in_uffd_wp = !!(vma->vm_flags & VM_UFFD_WP); +#else + in_uffd_wp = false; +#endif +#ifdef CONFIG_MEM_SOFT_DIRTY + in_softdirty = !(vma->vm_flags & VM_SOFTDIRTY); +#else + in_softdirty = false; +#endif + if ((in_uffd_wp || in_softdirty) && + mm_tlb_flush_pending(mm)) + flush_tlb_page(vma, vmf->address); + if (!cow_user_page(new_page, old_page, vmf)) { /* * COW failed, if the fault was solved by other,