> On Jan 30, 2021, at 4:11 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > > From: Nadav Amit <namit@xxxxxxxxxx> > > Currently, deferred TLB flushes are detected in the mm granularity: if > there is any deferred TLB flush in the entire address space due to NUMA > migration, pte_accessible() in x86 would return true, and > ptep_clear_flush() would require a TLB flush. This would happen even if > the PTE resides in a completely different vma. [ snip ] > +static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb) > +{ > + struct mm_struct *mm = tlb->mm; > + u64 mm_gen; > + > + /* > + * Any change of PTE before calling __track_deferred_tlb_flush() must be > + * performed using RMW atomic operation that provides a memory barriers, > + * such as ptep_modify_prot_start(). The barrier ensure the PTEs are > + * written before the current generation is read, synchronizing > + * (implicitly) with flush_tlb_mm_range(). > + */ > + smp_mb__after_atomic(); > + > + mm_gen = atomic64_read(&mm->tlb_gen); > + > + /* > + * This condition checks for both first deferred TLB flush and for other > + * TLB pending or executed TLB flushes after the last table that we > + * updated. In the latter case, we are going to skip a generation, which > + * would lead to a full TLB flush. This should therefore not cause > + * correctness issues, and should not induce overheads, since anyhow in > + * TLB storms it is better to perform full TLB flush. > + */ > + if (mm_gen != tlb->defer_gen) { > + VM_BUG_ON(mm_gen < tlb->defer_gen); > + > + tlb->defer_gen = inc_mm_tlb_gen(mm); > + } > +} Andy’s comments managed to make me realize this code is wrong. We must call inc_mm_tlb_gen(mm) every time. Otherwise, a CPU that saw the old tlb_gen and updated it in its local cpu_tlbstate on a context-switch. If the process was not running when the TLB flush was issued, no IPI will be sent to the CPU. Therefore, later switch_mm_irqs_off() back to the process will not flush the local TLB. I need to think if there is a better solution. Multiple calls to inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush instead of one that is specific to the ranges, once the flush actually takes place. On x86 it’s practically a non-issue, since anyhow any update of more than 33-entries or so would cause a full TLB flush, but this is still ugly.