On Mon, May 31, 2021 at 5:53 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Mon, May 31, 2021 at 05:40:09PM -0700, akpm@xxxxxxxxxxxxxxxxxxxx wrote: > > > > The patch titled > > Subject: mm: improve mprotect(R|W) efficiency on pages referenced once > > has been added to the -mm tree. Its filename is > > mm-improve-mprotectrw-efficiency-on-pages-referenced-once.patch > > > > This patch should soon appear at > > https://ozlabs.org/~akpm/mmots/broken-out/mm-improve-mprotectrw-efficiency-on-pages-referenced-once.patch > > and later at > > https://ozlabs.org/~akpm/mmotm/broken-out/mm-improve-mprotectrw-efficiency-on-pages-referenced-once.patch > > > > Before you just go and hit "reply", please: > > a) Consider who else should be cc'ed > > b) Prefer to cc a suitable mailing list as well > > c) Ideally: find the original patch on the mailing list and do a > > reply-to-all to that, adding suitable additional cc's > > > > *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** > > > > The -mm tree is included into linux-next and is updated > > there every 3-4 working days > > > > ------------------------------------------------------ > > From: Peter Collingbourne <pcc@xxxxxxxxxx> > > Subject: mm: improve mprotect(R|W) efficiency on pages referenced once > > > > In the Scudo memory allocator [1] we would like to be able to detect > > use-after-free vulnerabilities involving large allocations by issuing > > mprotect(PROT_NONE) on the memory region used for the allocation when it > > is deallocated. Later on, after the memory region has been "quarantined" > > for a sufficient period of time we would like to be able to use it for > > another allocation by issuing mprotect(PROT_READ|PROT_WRITE). > > > > Before this patch, after removing the write protection, any writes to the > > memory region would result in page faults and entering the copy-on-write > > code path, even in the usual case where the pages are only referenced by a > > single PTE, harming performance unnecessarily. Make it so that any pages > > in anonymous mappings that are only referenced by a single PTE are > > immediately made writable during the mprotect so that we can avoid the > > page faults. > > > > This program shows the critical syscall sequence that we intend to use in > > the allocator: > > > > #include <string.h> > > #include <sys/mman.h> > > > > enum { kSize = 131072 }; > > > > int main(int argc, char **argv) { > > char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE, > > MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > > for (int i = 0; i != 100000; ++i) { > > memset(addr, i, kSize); > > mprotect((void *)addr, kSize, PROT_NONE); > > mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE); > > } > > } > > > > The effect of this patch on the above program was measured on a > > DragonBoard 845c by taking the median real time execution time of 10 runs. > > > > Before: 2.94s > > After: 0.66s > > > > The effect was also measured using one of the microbenchmarks that we > > normally use to benchmark the allocator [2], after modifying it to make > > the appropriate mprotect calls [3]. With an allocation size of 131072 > > bytes to trigger the allocator's "large allocation" code path the > > per-iteration time was measured as follows: > > > > Before: 27450ns > > After: 6010ns > > > > This patch means that we do more work during the mprotect call itself in > > exchange for less work when the pages are accessed. In the worst case, > > the pages are not accessed at all. The effect of this patch in such cases > > was measured using the following program: > > > > #include <string.h> > > #include <sys/mman.h> > > > > enum { kSize = 131072 }; > > > > int main(int argc, char **argv) { > > char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE, > > MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > > memset(addr, 1, kSize); > > for (int i = 0; i != 100000; ++i) { > > #ifdef PAGE_FAULT > > memset(addr + (i * 4096) % kSize, i, 4096); > > #endif > > mprotect((void *)addr, kSize, PROT_NONE); > > mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE); > > } > > } > > > > With PAGE_FAULT undefined (0 pages touched after removing write > > protection) the median real time execution time of 100 runs was measured > > as follows: > > > > Before: 0.330260s > > After: 0.338836s > > > > With PAGE_FAULT defined (1 page touched) the measurements were > > as follows: > > > > Before: 0.438048s > > After: 0.355661s > > > > So it seems that even with a single page fault the new approach is faster. > > > > I saw similar results if I adjusted the programs to use a larger mapping > > size. With kSize = 1048576 I get these numbers with PAGE_FAULT undefined: > > > > Before: 1.428988s > > After: 1.512016s > > > > i.e. around 5.5%. > > > > And these with PAGE_FAULT defined: > > > > Before: 1.518559s > > After: 1.524417s > > > > i.e. about the same. > > > > What I think we may conclude from these results is that for smaller > > mappings the advantage of the previous approach, although measurable, is > > wiped out by a single page fault. I think we may expect that there should > > be at least one access resulting in a page fault (under the previous > > approach) after making the pages writable, since the program presumably > > made the pages writable for a reason. > > > > For larger mappings we may guesstimate that the new approach wins if the > > density of future page faults is > 0.4%. But for the mappings that are > > large enough for density to matter (not just the absolute number of page > > faults) it doesn't seem like the increase in mprotect latency would be > > very large relative to the total mprotect execution time. > > > > Link: https://lkml.kernel.org/r/20210527190453.1259020-1-pcc@xxxxxxxxxx > > Link: https://linux-review.googlesource.com/id/I98d75ef90e20330c578871c87494d64b1df3f1b8 > > Link: [1] https://source.android.com/devices/tech/debug/scudo > > Link: [2] https://cs.android.com/android/platform/superproject/+/master:bionic/benchmarks/stdlib_benchmark.cpp;l=53;drc=e8693e78711e8f45ccd2b610e4dbe0b94d551cc9 > > Link: [3] https://github.com/pcc/llvm-project/commit/scudo-mprotect-secondary2 > > Signed-off-by: Peter Collingbourne <pcc@xxxxxxxxxx> > > Reviewed-by: Peter Xu <peterx@xxxxxxxxxx> > > Cc: Kostya Kortchinsky <kostyak@xxxxxxxxxx> > > Cc: Evgenii Stepanov <eugenis@xxxxxxxxxx> > > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > --- > > > > mm/mprotect.c | 29 ++++++++++++++++++++++++----- > > 1 file changed, 24 insertions(+), 5 deletions(-) > > > > --- a/mm/mprotect.c~mm-improve-mprotectrw-efficiency-on-pages-referenced-once > > +++ a/mm/mprotect.c > > @@ -35,6 +35,29 @@ > > > > #include "internal.h" > > > > +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, > > + unsigned long cp_flags) > > +{ > > Andrew, > > As raised in the original patch, IMHO we'd better add below: > > if (cp_flags & MM_CP_PROT_NUMA) > return false; > > Peter Collingbourne, could you confirm? Done in v5, and added some comments as requested by Andrew on the other thread. Peter