On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote: > From: Barry Song <v-songbaohua@xxxxxxxx> > > on x86, batched and deferred tlb shootdown has lead to 90% > performance increase on tlb shootdown. on arm64, HW can do > tlb shootdown without software IPI. But sync tlbi is still > quite expensive. [...] > .../features/vm/TLB/arch-support.txt | 2 +- > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/tlbbatch.h | 12 ++++ > arch/arm64/include/asm/tlbflush.h | 33 ++++++++- > arch/arm64/mm/flush.c | 69 +++++++++++++++++++ > arch/x86/include/asm/tlbflush.h | 5 +- > include/linux/mm_types_task.h | 4 +- > mm/rmap.c | 12 ++-- First of all, this patch needs to be split in some preparatory patches introducing/renaming functions with no functional change for x86. Once done, you can add the arm64-only changes. Now, on the implementation, I had some comments on v7 but we didn't get to a conclusion and the thread eventually died: https://lore.kernel.org/linux-mm/Y7cToj5mWd1ZbMyQ@xxxxxxx/ I know I said a command line argument is better than Kconfig or some random number of CPUs heuristics but it would be even better if we don't bother with any, just make this always on. Barry had some comments around mprotect() being racy and that's why we have flush_tlb_batched_pending() but I don't think it's needed (or, for arm64, it can be a DSB since this patch issues the TLBIs but without the DVM Sync). So we need to clarify this (see Barry's last email on the above thread) and before attempting new versions of this patchset. With flush_tlb_batched_pending() removed (or DSB), I have a suspicion such implementation would be faster on any SoC irrespective of the number of CPUs. -- Catalin