On Oct 16, 2022, at 2:47 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxx> wrote: > On Fri, Oct 14, 2022 at 8:51 PM Nadav Amit <nadav.amit@xxxxxxxxx> wrote: >> Unless I am missing something, flush_tlb_batched_pending() is would be >> called and do the flushing at this point, no? > > Ahh, yes. > > That seems to be doing the right thing, although looking a bit more at > it, I think it might be improved. > > At least in the zap_pte_range() case, instead of doing a synchronous > TLB flush if there are pending batched flushes, it migth be better if > flush_tlb_batched_pending() would set the "need_flush_all" bit in the > mmu_gather structure. > > That would possibly avoid that extra TLB flush entirely - since > *normally* fzap_page_range() will cause a TLB flush anyway. > > Maybe it doesn't matter. It seems possible and simple. But in general, there are still various unnecessary TLB flushes due to the TLB batching. Specifically, ptep_clear_flush() might flush unnecessarily when pte_accessible() finds tlb_flush_pending holding a non-zero value. Worse, the complexity of the code is high. To simplify the TLB flushing mechanism and eliminate the unnecessary TLB flushes, it is possible to track the “completed” TLB generation (i.e., one that was flushed). Tracking pending TLB flushes can be done in VMA- or page-table granularity instead of mm-grnaulrity to avoid unnecessary flushes on ptep_clear_flush(). Andy also suggested having a queue of the pending TLB flushes. The main problem is that each of the aforementioned enhancements can add some cache references, and therefore might induce additional overheads. I sent some patches before [1], which I can revive. The main question is whether we can prioritize simplicity and unification of the various TLB-flush batching mechanisms over (probably very small) performance gains. [1] https://lore.kernel.org/linux-mm/20210131001132.3368247-1-namit@xxxxxxxxxx/