On 9/20/19 1:12 PM, Leonardo Bras wrote: > If a process (qemu) with a lot of CPUs (128) try to munmap() a large > chunk of memory (496GB) mapped with THP, it takes an average of 275 > seconds, which can cause a lot of problems to the load (in qemu case, > the guest will lock for this time). > > Trying to find the source of this bug, I found out most of this time is > spent on serialize_against_pte_lookup(). This function will take a lot > of time in smp_call_function_many() if there is more than a couple CPUs > running the user process. Since it has to happen to all THP mapped, it > will take a very long time for large amounts of memory. > > By the docs, serialize_against_pte_lookup() is needed in order to avoid > pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless > pagetable walk, to happen concurrently with THP splitting/collapsing. > > It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[], > after interrupts are re-enabled. > Since, interrupts are (usually) disabled during lockless pagetable > walk, and serialize_against_pte_lookup will only return after > interrupts are enabled, it is protected. > > So, by what I could understand, if there is no lockless pagetable walk > running, there is no need to call serialize_against_pte_lookup(). > > So, to avoid the cost of running serialize_against_pte_lookup(), I > propose a counter that keeps track of how many find_current_mm_pte() > are currently running, and if there is none, just skip > smp_call_function_many(). Just noticed that this really should also include linux-mm, maybe it's best to repost the patchset with them included? In particular, there is likely to be some feedback about adding more calls, in addition to local_irq_disable/enable, around the gup_fast() path, separately from my questions about the synchronization cases in ppc. thanks, -- John Hubbard NVIDIA > > The related functions are: > start_lockless_pgtbl_walk(mm) > Insert before starting any lockless pgtable walk > end_lockless_pgtbl_walk(mm) > Insert after the end of any lockless pgtable walk > (Mostly after the ptep is last used) > running_lockless_pgtbl_walk(mm) > Returns the number of lockless pgtable walks running > > > On my workload (qemu), I could see munmap's time reduction from 275 > seconds to 418ms. > >> Leonardo Bras (11): >> powerpc/mm: Adds counting method to monitor lockless pgtable walks >> asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable >> walks >> mm/gup: Applies counting method to monitor gup_pgd_range >> powerpc/mce_power: Applies counting method to monitor lockless pgtbl >> walks >> powerpc/perf: Applies counting method to monitor lockless pgtbl walks >> powerpc/mm/book3s64/hash: Applies counting method to monitor lockless >> pgtbl walks >> powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl >> walks >> powerpc/kvm/book3s_hv: Applies counting method to monitor lockless >> pgtbl walks >> powerpc/kvm/book3s_64: Applies counting method to monitor lockless >> pgtbl walks >> powerpc/book3s_64: Enables counting method to monitor lockless pgtbl >> walk >> powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing >> >> arch/powerpc/include/asm/book3s/64/mmu.h | 3 +++ >> arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++++ >> arch/powerpc/kernel/mce_power.c | 13 ++++++++++--- >> arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 ++ >> arch/powerpc/kvm/book3s_64_mmu_radix.c | 20 ++++++++++++++++++-- >> arch/powerpc/kvm/book3s_64_vio_hv.c | 4 ++++ >> arch/powerpc/kvm/book3s_hv_nested.c | 8 ++++++++ >> arch/powerpc/kvm/book3s_hv_rm_mmu.c | 9 ++++++++- >> arch/powerpc/kvm/e500_mmu_host.c | 4 ++++ >> arch/powerpc/mm/book3s64/hash_tlb.c | 2 ++ >> arch/powerpc/mm/book3s64/hash_utils.c | 7 +++++++ >> arch/powerpc/mm/book3s64/mmu_context.c | 1 + >> arch/powerpc/mm/book3s64/pgtable.c | 20 +++++++++++++++++++- >> arch/powerpc/perf/callchain.c | 5 ++++- >> include/asm-generic/pgtable.h | 9 +++++++++ >> mm/gup.c | 4 ++++ >> 16 files changed, 108 insertions(+), 8 deletions(-) >>