On Oct 27, 2022, at 1:31 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > So there are two levels of tlb flush optimizations > > (a) avoiding them entirely in the first place > > (b) the whole "once you have to flush, keep track of lazy modes and > TLB generations, and flush ranges" > > And honestly, I think you ignored (a), and that's where we do exactly > those kinds of "this case doesn't need to flush AT ALL" things. I did try to avoid TLB flushes by introducing pte_needs_flush() and avoiding flushes based on the architectural PTE changes. There are even more x86 arch-based opportunities to further avoid TLB flushes (and then only flush the TLB if spurious #PF occurs). Personally, I still think that making decisions on flushes based on (mostly) only the arch state makes the code more robust against misuse (e.g., see various confusions between mmu_gather’s fullmm and need_flush_all). Having said that, I will follow your feedback that the extra complexity worth the extra performance. Anyhow, admittedly, I need to give it more thought. For instance, in respect to the code that you mentioned (in zap_pte_range), after reading it again, it seems strange: how is ok to defer the TLB flush after the rmap was removed, even if it is done while the PTL is held. folio_clear_dirty_for_io() would not sync on the PTL afterwards, so the page might be later re-dirtied using a stale cached PTE. Oh well.