> On Feb 1, 2021, at 2:04 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > > >> >> On Jan 30, 2021, at 4:11 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote: >> >> From: Nadav Amit <namit@xxxxxxxxxx> >> >> Currently, deferred TLB flushes are detected in the mm granularity: if >> there is any deferred TLB flush in the entire address space due to NUMA >> migration, pte_accessible() in x86 would return true, and >> ptep_clear_flush() would require a TLB flush. This would happen even if >> the PTE resides in a completely different vma. > > [ snip ] > >> +static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb) >> +{ >> + struct mm_struct *mm = tlb->mm; >> + u64 mm_gen; >> + >> + /* >> + * Any change of PTE before calling __track_deferred_tlb_flush() must be >> + * performed using RMW atomic operation that provides a memory barriers, >> + * such as ptep_modify_prot_start(). The barrier ensure the PTEs are >> + * written before the current generation is read, synchronizing >> + * (implicitly) with flush_tlb_mm_range(). >> + */ >> + smp_mb__after_atomic(); >> + >> + mm_gen = atomic64_read(&mm->tlb_gen); >> + >> + /* >> + * This condition checks for both first deferred TLB flush and for other >> + * TLB pending or executed TLB flushes after the last table that we >> + * updated. In the latter case, we are going to skip a generation, which >> + * would lead to a full TLB flush. This should therefore not cause >> + * correctness issues, and should not induce overheads, since anyhow in >> + * TLB storms it is better to perform full TLB flush. >> + */ >> + if (mm_gen != tlb->defer_gen) { >> + VM_BUG_ON(mm_gen < tlb->defer_gen); >> + >> + tlb->defer_gen = inc_mm_tlb_gen(mm); >> + } >> +} > > Andy’s comments managed to make me realize this code is wrong. We must > call inc_mm_tlb_gen(mm) every time. > > Otherwise, a CPU that saw the old tlb_gen and updated it in its local > cpu_tlbstate on a context-switch. If the process was not running when the > TLB flush was issued, no IPI will be sent to the CPU. Therefore, later > switch_mm_irqs_off() back to the process will not flush the local TLB. > > I need to think if there is a better solution. Multiple calls to > inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush > instead of one that is specific to the ranges, once the flush actually takes > place. On x86 it’s practically a non-issue, since anyhow any update of more > than 33-entries or so would cause a full TLB flush, but this is still ugly. > What if we had a per-mm ring buffer of flushes? When starting a flush, we would stick the range in the ring buffer and, when flushing, we would read the ring buffer to catch up. This would mostly replace the flush_tlb_info struct, and it would let us process multiple partial flushes together.