From: Nadav Amit <namit@xxxxxxxxxx> This patch-set is intended to remove unnecessary TLB flushes during mprotect() syscalls. Once this patch-set make it through, similar and further optimizations for MADV_COLD and userfaultfd would be possible. Basically, there are 3 optimizations in this patch-set: 1. Use TLB batching infrastructure to batch flushes across VMAs and do better/fewer flushes. This would also be handy for later userfaultfd enhancements. 2. Avoid unnecessary TLB flushes. This optimization is the one that provides most of the performance benefits. Unlike previous versions, we now only avoid flushes that would not result in spurious page-faults. 3. Avoiding TLB flushes on change_huge_pmd() that are only needed to prevent the A/D bits from changing. Andrew asked for some benchmark numbers. I do not have an easy determinate macrobenchmark in which it is easy to show benefit. I therre ran a microbenchmark: a loop that does the following on anonymous memory, just as a sanity check to see that time is saved by avoiding TLB flushes. The loop goes: mprotect(p, PAGE_SIZE, PROT_READ) mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE) *p = 0; // make the page writable The test was run in KVM guest with 1 or 2 threads (the second thread was busy-looping). I measured the time (cycles) of each operation: 1 thread 2 threads mmots +patch mmots +patch PROT_READ 3494 2725 (-22%) 8630 7788 (-10%) PROT_READ|WRITE 3952 2724 (-31%) 9075 2865 (-68%) [ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ] The exact numbers are really meaningless, but the benefit is clear. There are 2 interesting results though. (1) PROT_READ is cheaper, while one can expect it not to be affected. This is presumably due to TLB miss that is saved (2) Without memory access (*p = 0), the speedup of the patch is even greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush. As a result both operations on the patched kernel take roughly ~1500 cycles (with either 1 or 2 threads), whereas on mmotm their cost is as high as presented in the table. Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Cc: Andy Lutomirski <luto@xxxxxxxxxx> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> Cc: Peter Xu <peterx@xxxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Cc: Will Deacon <will@xxxxxxxxxx> Cc: Yu Zhao <yuzhao@xxxxxxxxxx> Cc: Nick Piggin <npiggin@xxxxxxxxx> Cc: x86@xxxxxxxxxx -- v5 -> v6: * Wrong patch 2 was sent on v5 v4 -> v5: * Avoid only TLB flushes that would not result in spurious PF [Dave] * Better comments, names in pte_flags_need_flush() [Dave] v3 -> v4: * Remove KNL-related stuff [Dave] * Check error code sanity on every PF [Dave] * Reduce nesting, simplify access_error() changes [Dave] * Remove redundant present->non-present check * Use break instead of goto in do_mprotect_pkey() * Add missing change_prot_numa() chunk v2 -> v3: * Fix orders of patches (order could lead to breakage) * Better comments * Clearer KNL detection [Dave] * Assertion on PF error-code [Dave] * Comments, code, function names improvements [PeterZ] * Flush on access-bit clearing on PMD changes to follow the way flushing on x86 is done today in the kernel. v1 -> v2: * Wrong detection of permission demotion [Andrea] * Better comments [Andrea] * Handle THP [Andrea] * Batching across VMAs [Peter Xu] * Avoid open-coding PTE analysis * Fix wrong use of the mmu_gather() Nadav Amit (3): mm/mprotect: use mmu_gather mm/mprotect: do not flush when not required architecturally mm: avoid unnecessary flush on change_huge_pmd() arch/x86/include/asm/pgtable.h | 5 ++ arch/x86/include/asm/pgtable_types.h | 2 + arch/x86/include/asm/tlbflush.h | 121 +++++++++++++++++++++++++++ arch/x86/mm/pgtable.c | 10 +++ fs/exec.c | 6 +- include/asm-generic/tlb.h | 14 ++++ include/linux/huge_mm.h | 5 +- include/linux/mm.h | 5 +- include/linux/pgtable.h | 20 +++++ mm/huge_memory.c | 19 +++-- mm/mempolicy.c | 9 +- mm/mprotect.c | 93 ++++++++++---------- mm/pgtable-generic.c | 8 ++ mm/userfaultfd.c | 6 +- 14 files changed, 268 insertions(+), 55 deletions(-) -- 2.25.1