Anshuman Khandual <anshuman.khandual@xxxxxxx> writes: > On 10/28/22 03:25, Barry Song wrote: >> On Fri, Oct 28, 2022 at 3:19 AM Punit Agrawal >> <punit.agrawal@xxxxxxxxxxxxx> wrote: >>> >>> [ Apologies for chiming in late in the conversation ] >>> >>> Anshuman Khandual <anshuman.khandual@xxxxxxx> writes: >>> >>>> On 9/28/22 05:53, Barry Song wrote: >>>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@xxxxxxxxxx> wrote: >>>>>> On 2022/9/27 14:16, Anshuman Khandual wrote: >>>>>>> [...] >>>>>>> >>>>>>> On 9/21/22 14:13, Yicong Yang wrote: >>>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>>>>>>> +{ >>>>>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ >>>>>>>> + if (num_online_cpus() <= 4) >>>>>>> It would be great to have some more inputs from others, whether 4 (which should >>>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) >>>>>>> is optimal for an wide range of arm64 platforms. >>>>>>> >>>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine >>>>> with 5,6,7 >>>>> cores. >>>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need >>>>> this patch. >>>>> >>>>> so it seems safe to have >>>>> if (num_online_cpus() < 8) >>>>> >>>>>> Do you prefer this macro to be static or make it configurable through kconfig then >>>>>> different platforms can make choice based on their own situations? It maybe hard to >>>>>> test on all the arm64 platforms. >>>>> Maybe we can have this default enabled on machines with 8 and more cpus and >>>>> provide a tlbflush_batched = on or off to allow users enable or >>>>> disable it according >>>>> to their hardware and products. Similar example: rodata=on or off. >>>> No, sounds bit excessive. Kernel command line options should not be added >>>> for every possible run time switch options. >>>> >>>>> Hi Anshuman, Will, Catalin, Andrew, >>>>> what do you think about this approach? >>>>> >>>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: >>>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@xxxxxxxxxxxxxxxxx/ >>>>> >>>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64 >>>>> even by hardware broadcast. >>>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively >>>> with CONFIG_EXPERT and for num_online_cpus() > 8 ? >>> When running the test program in the commit in a VM, I saw benefits from >>> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine, >>> ptep_clear_flush() went from ~1% in the unpatched version to not showing >>> up. >>> >>> Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is >>> there any overhead? I am wondering what are the downsides of enabling >>> the config by default. >> As we are deferring tlb flush, but sometimes while we are modifying the vma >> which are deferred, we need to do a sync by flush_tlb_batched_pending() in >> mprotect() , madvise() to make sure they can see the flushed result. if nobody >> is doing mprotect(), madvise() etc in the deferred period, the overhead is zero. > > Right, it is difficult to justify this overhead for smaller systems, > which for sure would not benefit from this batched TLB framework. Thank you for the pointers to the overhead. Having looked at this more closely, I also see that flush_tlb_batched_pending() discards the entire mm vs just flushing the page being unmapped (as is done with ptep_clear_flush()).