On Mon, May 31, 2021 at 05:40:09PM -0700, akpm@xxxxxxxxxxxxxxxxxxxx wrote: > > The patch titled > Subject: mm: improve mprotect(R|W) efficiency on pages referenced once > has been added to the -mm tree. Its filename is > mm-improve-mprotectrw-efficiency-on-pages-referenced-once.patch > > This patch should soon appear at > https://ozlabs.org/~akpm/mmots/broken-out/mm-improve-mprotectrw-efficiency-on-pages-referenced-once.patch > and later at > https://ozlabs.org/~akpm/mmotm/broken-out/mm-improve-mprotectrw-efficiency-on-pages-referenced-once.patch > > Before you just go and hit "reply", please: > a) Consider who else should be cc'ed > b) Prefer to cc a suitable mailing list as well > c) Ideally: find the original patch on the mailing list and do a > reply-to-all to that, adding suitable additional cc's > > *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** > > The -mm tree is included into linux-next and is updated > there every 3-4 working days > > ------------------------------------------------------ > From: Peter Collingbourne <pcc@xxxxxxxxxx> > Subject: mm: improve mprotect(R|W) efficiency on pages referenced once > > In the Scudo memory allocator [1] we would like to be able to detect > use-after-free vulnerabilities involving large allocations by issuing > mprotect(PROT_NONE) on the memory region used for the allocation when it > is deallocated. Later on, after the memory region has been "quarantined" > for a sufficient period of time we would like to be able to use it for > another allocation by issuing mprotect(PROT_READ|PROT_WRITE). > > Before this patch, after removing the write protection, any writes to the > memory region would result in page faults and entering the copy-on-write > code path, even in the usual case where the pages are only referenced by a > single PTE, harming performance unnecessarily. Make it so that any pages > in anonymous mappings that are only referenced by a single PTE are > immediately made writable during the mprotect so that we can avoid the > page faults. > > This program shows the critical syscall sequence that we intend to use in > the allocator: > > #include <string.h> > #include <sys/mman.h> > > enum { kSize = 131072 }; > > int main(int argc, char **argv) { > char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > for (int i = 0; i != 100000; ++i) { > memset(addr, i, kSize); > mprotect((void *)addr, kSize, PROT_NONE); > mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE); > } > } > > The effect of this patch on the above program was measured on a > DragonBoard 845c by taking the median real time execution time of 10 runs. > > Before: 2.94s > After: 0.66s > > The effect was also measured using one of the microbenchmarks that we > normally use to benchmark the allocator [2], after modifying it to make > the appropriate mprotect calls [3]. With an allocation size of 131072 > bytes to trigger the allocator's "large allocation" code path the > per-iteration time was measured as follows: > > Before: 27450ns > After: 6010ns > > This patch means that we do more work during the mprotect call itself in > exchange for less work when the pages are accessed. In the worst case, > the pages are not accessed at all. The effect of this patch in such cases > was measured using the following program: > > #include <string.h> > #include <sys/mman.h> > > enum { kSize = 131072 }; > > int main(int argc, char **argv) { > char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > memset(addr, 1, kSize); > for (int i = 0; i != 100000; ++i) { > #ifdef PAGE_FAULT > memset(addr + (i * 4096) % kSize, i, 4096); > #endif > mprotect((void *)addr, kSize, PROT_NONE); > mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE); > } > } > > With PAGE_FAULT undefined (0 pages touched after removing write > protection) the median real time execution time of 100 runs was measured > as follows: > > Before: 0.330260s > After: 0.338836s > > With PAGE_FAULT defined (1 page touched) the measurements were > as follows: > > Before: 0.438048s > After: 0.355661s > > So it seems that even with a single page fault the new approach is faster. > > I saw similar results if I adjusted the programs to use a larger mapping > size. With kSize = 1048576 I get these numbers with PAGE_FAULT undefined: > > Before: 1.428988s > After: 1.512016s > > i.e. around 5.5%. > > And these with PAGE_FAULT defined: > > Before: 1.518559s > After: 1.524417s > > i.e. about the same. > > What I think we may conclude from these results is that for smaller > mappings the advantage of the previous approach, although measurable, is > wiped out by a single page fault. I think we may expect that there should > be at least one access resulting in a page fault (under the previous > approach) after making the pages writable, since the program presumably > made the pages writable for a reason. > > For larger mappings we may guesstimate that the new approach wins if the > density of future page faults is > 0.4%. But for the mappings that are > large enough for density to matter (not just the absolute number of page > faults) it doesn't seem like the increase in mprotect latency would be > very large relative to the total mprotect execution time. > > Link: https://lkml.kernel.org/r/20210527190453.1259020-1-pcc@xxxxxxxxxx > Link: https://linux-review.googlesource.com/id/I98d75ef90e20330c578871c87494d64b1df3f1b8 > Link: [1] https://source.android.com/devices/tech/debug/scudo > Link: [2] https://cs.android.com/android/platform/superproject/+/master:bionic/benchmarks/stdlib_benchmark.cpp;l=53;drc=e8693e78711e8f45ccd2b610e4dbe0b94d551cc9 > Link: [3] https://github.com/pcc/llvm-project/commit/scudo-mprotect-secondary2 > Signed-off-by: Peter Collingbourne <pcc@xxxxxxxxxx> > Reviewed-by: Peter Xu <peterx@xxxxxxxxxx> > Cc: Kostya Kortchinsky <kostyak@xxxxxxxxxx> > Cc: Evgenii Stepanov <eugenis@xxxxxxxxxx> > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > --- > > mm/mprotect.c | 29 ++++++++++++++++++++++++----- > 1 file changed, 24 insertions(+), 5 deletions(-) > > --- a/mm/mprotect.c~mm-improve-mprotectrw-efficiency-on-pages-referenced-once > +++ a/mm/mprotect.c > @@ -35,6 +35,29 @@ > > #include "internal.h" > > +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, > + unsigned long cp_flags) > +{ Andrew, As raised in the original patch, IMHO we'd better add below: if (cp_flags & MM_CP_PROT_NUMA) return false; Peter Collingbourne, could you confirm? Thanks, > + if (!(cp_flags & MM_CP_DIRTY_ACCT)) { > + if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE))) > + return false; > + > + if (page_count(pte_page(pte)) != 1) > + return false; > + } > + > + if (!pte_dirty(pte)) > + return false; > + > + if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) > + return false; > + > + if (pte_uffd_wp(pte)) > + return false; > + > + return true; > +} > + > static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > unsigned long addr, unsigned long end, pgprot_t newprot, > unsigned long cp_flags) > @@ -43,7 +66,6 @@ static unsigned long change_pte_range(st > spinlock_t *ptl; > unsigned long pages = 0; > int target_node = NUMA_NO_NODE; > - bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT; > bool prot_numa = cp_flags & MM_CP_PROT_NUMA; > bool uffd_wp = cp_flags & MM_CP_UFFD_WP; > bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; > @@ -132,11 +154,8 @@ static unsigned long change_pte_range(st > } > > /* Avoid taking write faults for known dirty pages */ > - if (dirty_accountable && pte_dirty(ptent) && > - (pte_soft_dirty(ptent) || > - !(vma->vm_flags & VM_SOFTDIRTY))) { > + if (may_avoid_write_fault(ptent, vma, cp_flags)) > ptent = pte_mkwrite(ptent); > - } > ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); > pages++; > } else if (is_swap_pte(oldpte)) { > _ > > Patches currently in -mm which might be from pcc@xxxxxxxxxx are > > mm-improve-mprotectrw-efficiency-on-pages-referenced-once.patch > -- Peter Xu