> On Feb 1, 2021, at 4:14 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: > > >> On Feb 1, 2021, at 2:04 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote: >> >> Andy’s comments managed to make me realize this code is wrong. We must >> call inc_mm_tlb_gen(mm) every time. >> >> Otherwise, a CPU that saw the old tlb_gen and updated it in its local >> cpu_tlbstate on a context-switch. If the process was not running when the >> TLB flush was issued, no IPI will be sent to the CPU. Therefore, later >> switch_mm_irqs_off() back to the process will not flush the local TLB. >> >> I need to think if there is a better solution. Multiple calls to >> inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush >> instead of one that is specific to the ranges, once the flush actually takes >> place. On x86 it’s practically a non-issue, since anyhow any update of more >> than 33-entries or so would cause a full TLB flush, but this is still ugly. > > What if we had a per-mm ring buffer of flushes? When starting a flush, we would stick the range in the ring buffer and, when flushing, we would read the ring buffer to catch up. This would mostly replace the flush_tlb_info struct, and it would let us process multiple partial flushes together. I wanted to sleep on it, and went back and forth on whether it is the right direction, hence the late response. I think that what you say make sense. I think that I even tried to do once something similar for some reason, but my memory plays tricks on me. So tell me what you think on this ring-based solution. As you said, you keep per-mm ring of flush_tlb_info. When you queue an entry, you do something like: #define RING_ENTRY_INVALID (0) gen = inc_mm_tlb_gen(mm); struct flush_tlb_info *info = mm->ring[gen % RING_SIZE]; spin_lock(&mm->ring_lock); WRITE_ONCE(info->new_tlb_gen, RING_ENTRY_INVALID); smp_wmb(); info->start = start; info->end = end; info->stride_shift = stride_shift; info->freed_tables = freed_tables; smp_store_release(&info->new_tlb_gen, gen); spin_unlock(&mm->ring_lock); When you flush you use the entry generation as a sequence lock. On overflow of the ring (i.e., sequence number mismatch) you perform a full flush: for (gen = mm->tlb_gen_completed; gen < mm->tlb_gen; gen++) { struct flush_tlb_info *info = &mm->ring[gen % RING_SIZE]; // detect overflow and invalid entries if (smp_load_acquire(info->new_tlb_gen) != gen) goto full_flush; start = min(start, info->start); end = max(end, info->end); stride_shift = min(stride_shift, info->stride_shift); freed_tables |= info.freed_tables; smp_rmb(); // seqlock-like check that the information was not updated if (READ_ONCE(info->new_tlb_gen) != gen) goto full_flush; } On x86 I suspect that performing a full TLB flush would anyhow be the best thing to do if there is more than a single entry. I am also not sure that it makes sense to check the ring from flush_tlb_func_common() (i.e., in each IPI handler) as it might cause cache thrashing. Instead it may be better to do so from flush_tlb_mm_range(), when the flushes are initiated, and use an aggregated flush_tlb_info for the flush. It may also be better to have the ring arch-independent, so it would resemble more of mmu_gather (the parts about the TLB flush information, without the freed pages stuff). We can detect deferred TLB flushes either by storing “deferred_gen” in the page-tables/VMA (as I did) or by going over the ring, from tlb_gen_completed to tlb_gen, and checking for an overlap. I think page-tables would be most efficient/scalable, but perhaps going over the ring would be easier to understand logic. Makes sense? Thoughts?