From: Nadav Amit <namit@xxxxxxxxxx> There are currently (at least?) 5 different TLB batching schemes in the kernel: 1. Using mmu_gather (e.g., zap_page_range()). 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the ongoing deferred TLB flush and flushing the entire range eventually (e.g., change_protection_range()). 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). 4. Batching per-table flushes (move_ptes()). 5. By setting a flag on that a deferred TLB flush operation takes place, flushing when (try_to_unmap_one() on x86). It seems that (1)-(4) can be consolidated. In addition, it seems that (5) is racy. It also seems there can be many redundant TLB flushes, and potentially TLB-shootdown storms, for instance during batched reclamation (using try_to_unmap_one()) if at the same time mmu_gather defers TLB flushes. More aggressive TLB batching may be possible, but this patch-set does not add such batching. The proposed changes would enable such batching in a later time. Admittedly, I do not understand how things are not broken today, which frightens me to make further batching before getting things in order. For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes for each page-table (but not in greater granularity). Can't ClearPageDirty() be called before the flush, causing writes after ClearPageDirty() and before the flush to be lost? This patch-set therefore performs the following changes: 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather instead of {inc|dec}_tlb_flush_pending(). 2. Avoid TLB flushes if PTE permission is not demoted. 3. Cleans up mmu_gather to be less arch-dependant. 4. Uses mm's generations to track in finer granularity, either per-VMA or per page-table, whether a pending mmu_gather operation is outstanding. This should allow to avoid some TLB flushes when KSM or memory reclamation takes place while another operation such as munmap() or mprotect() is running. 5. Changes try_to_unmap_one() flushing scheme, as the current seems broken to track in a bitmap which CPUs have outstanding TLB flushes instead of having a flag. Further optimizations are possible, such as changing move_ptes() to use mmu_gather. The patches were very very lightly tested. I am looking forward for your feedback regarding the overall approaches, and whether to split them into multiple patch-sets. Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Cc: Andy Lutomirski <luto@xxxxxxxxxx> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> Cc: linux-csky@xxxxxxxxxxxxxxx Cc: linuxppc-dev@xxxxxxxxxxxxxxxx Cc: linux-s390@xxxxxxxxxxxxxxx Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> Cc: Nick Piggin <npiggin@xxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Cc: Will Deacon <will@xxxxxxxxxx> Cc: x86@xxxxxxxxxx Cc: Yu Zhao <yuzhao@xxxxxxxxxx> Nadav Amit (20): mm/tlb: fix fullmm semantics mm/mprotect: use mmu_gather mm/mprotect: do not flush on permission promotion mm/mapping_dirty_helpers: use mmu_gather mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h fs/task_mmu: use mmu_gather interface of clear-soft-dirty mm: move x86 tlb_gen to generic code mm: store completed TLB generation mm: create pte/pmd_tlb_flush_pending() mm: add pte_to_page() mm/tlb: remove arch-specific tlb_start/end_vma() mm/tlb: save the VMA that is flushed during tlb_start_vma() mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes() mm: move inc/dec_tlb_flush_pending() to mmu_gather.c mm: detect deferred TLB flushes in vma granularity mm/tlb: per-page table generation tracking mm/tlb: updated completed deferred TLB flush conditionally mm: make mm_cpumask() volatile lib/cpumask: introduce cpumask_atomic_or() mm/rmap: avoid potential races arch/arm/include/asm/bitops.h | 4 +- arch/arm/include/asm/pgtable.h | 4 +- arch/arm64/include/asm/pgtable.h | 4 +- arch/csky/Kconfig | 1 + arch/csky/include/asm/tlb.h | 12 -- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/tlb.h | 2 - arch/s390/Kconfig | 1 + arch/s390/include/asm/tlb.h | 3 - arch/sparc/Kconfig | 1 + arch/sparc/include/asm/pgtable_64.h | 9 +- arch/sparc/include/asm/tlb_64.h | 2 - arch/sparc/mm/init_64.c | 2 +- arch/x86/Kconfig | 3 + arch/x86/hyperv/mmu.c | 2 +- arch/x86/include/asm/mmu.h | 10 - arch/x86/include/asm/mmu_context.h | 1 - arch/x86/include/asm/paravirt_types.h | 2 +- arch/x86/include/asm/pgtable.h | 24 +-- arch/x86/include/asm/tlb.h | 21 +- arch/x86/include/asm/tlbbatch.h | 15 -- arch/x86/include/asm/tlbflush.h | 61 ++++-- arch/x86/mm/tlb.c | 52 +++-- arch/x86/xen/mmu_pv.c | 2 +- drivers/firmware/efi/efi.c | 1 + fs/proc/task_mmu.c | 29 ++- include/asm-generic/bitops/find.h | 8 +- include/asm-generic/tlb.h | 291 +++++++++++++++++++++----- include/linux/bitmap.h | 21 +- include/linux/cpumask.h | 40 ++-- include/linux/huge_mm.h | 3 +- include/linux/mm.h | 29 ++- include/linux/mm_types.h | 166 ++++++++++----- include/linux/mm_types_task.h | 13 -- include/linux/pgtable.h | 2 +- include/linux/smp.h | 6 +- init/Kconfig | 21 ++ kernel/fork.c | 2 + kernel/smp.c | 8 +- lib/bitmap.c | 33 ++- lib/cpumask.c | 8 +- lib/find_bit.c | 10 +- mm/huge_memory.c | 6 +- mm/init-mm.c | 1 + mm/internal.h | 16 -- mm/ksm.c | 2 +- mm/madvise.c | 6 +- mm/mapping_dirty_helpers.c | 52 +++-- mm/memory.c | 2 + mm/mmap.c | 1 + mm/mmu_gather.c | 59 +++++- mm/mprotect.c | 55 ++--- mm/mremap.c | 2 +- mm/pgtable-generic.c | 2 +- mm/rmap.c | 42 ++-- mm/vmscan.c | 1 + 56 files changed, 803 insertions(+), 374 deletions(-) delete mode 100644 arch/x86/include/asm/tlbbatch.h -- 2.25.1