> On Feb 2, 2021, at 12:52 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > > >> >>> On Feb 1, 2021, at 4:14 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >>> >>> >>>> On Feb 1, 2021, at 2:04 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote: >>> >>> Andy’s comments managed to make me realize this code is wrong. We must >>> call inc_mm_tlb_gen(mm) every time. >>> >>> Otherwise, a CPU that saw the old tlb_gen and updated it in its local >>> cpu_tlbstate on a context-switch. If the process was not running when the >>> TLB flush was issued, no IPI will be sent to the CPU. Therefore, later >>> switch_mm_irqs_off() back to the process will not flush the local TLB. >>> >>> I need to think if there is a better solution. Multiple calls to >>> inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush >>> instead of one that is specific to the ranges, once the flush actually takes >>> place. On x86 it’s practically a non-issue, since anyhow any update of more >>> than 33-entries or so would cause a full TLB flush, but this is still ugly. >> >> What if we had a per-mm ring buffer of flushes? When starting a flush, we would stick the range in the ring buffer and, when flushing, we would read the ring buffer to catch up. This would mostly replace the flush_tlb_info struct, and it would let us process multiple partial flushes together. > > I wanted to sleep on it, and went back and forth on whether it is the right > direction, hence the late response. > > I think that what you say make sense. I think that I even tried to do once > something similar for some reason, but my memory plays tricks on me. > > So tell me what you think on this ring-based solution. As you said, you keep > per-mm ring of flush_tlb_info. When you queue an entry, you do something > like: > > #define RING_ENTRY_INVALID (0) > > gen = inc_mm_tlb_gen(mm); > struct flush_tlb_info *info = mm->ring[gen % RING_SIZE]; > spin_lock(&mm->ring_lock); Once you are holding the lock, you should presumably check that the ring didn’t overflow while you were getting the lock — if new_tlb_gen > gen, abort. > WRITE_ONCE(info->new_tlb_gen, RING_ENTRY_INVALID); > smp_wmb(); > info->start = start; > info->end = end; > info->stride_shift = stride_shift; > info->freed_tables = freed_tables; > smp_store_release(&info->new_tlb_gen, gen); > spin_unlock(&mm->ring_lock); > Seems reasonable. I’m curious how this ends up getting used. > When you flush you use the entry generation as a sequence lock. On overflow > of the ring (i.e., sequence number mismatch) you perform a full flush: > > for (gen = mm->tlb_gen_completed; gen < mm->tlb_gen; gen++) { > struct flush_tlb_info *info = &mm->ring[gen % RING_SIZE]; > > // detect overflow and invalid entries > if (smp_load_acquire(info->new_tlb_gen) != gen) > goto full_flush; > > start = min(start, info->start); > end = max(end, info->end); > stride_shift = min(stride_shift, info->stride_shift); > freed_tables |= info.freed_tables; > smp_rmb(); > > // seqlock-like check that the information was not updated > if (READ_ONCE(info->new_tlb_gen) != gen) > goto full_flush; > } > > On x86 I suspect that performing a full TLB flush would anyhow be the best > thing to do if there is more than a single entry. I am also not sure that it > makes sense to check the ring from flush_tlb_func_common() (i.e., in each > IPI handler) as it might cause cache thrashing. > > Instead it may be better to do so from flush_tlb_mm_range(), when the > flushes are initiated, and use an aggregated flush_tlb_info for the flush. > > It may also be better to have the ring arch-independent, so it would > resemble more of mmu_gather (the parts about the TLB flush information, > without the freed pages stuff). > > We can detect deferred TLB flushes either by storing “deferred_gen” in the > page-tables/VMA (as I did) or by going over the ring, from tlb_gen_completed > to tlb_gen, and checking for an overlap. I think page-tables would be most > efficient/scalable, but perhaps going over the ring would be easier to > understand logic. > > Makes sense? Thoughts?