Hello Catalin, On Mon, Feb 10, 2020 at 05:51:06PM +0000, Catalin Marinas wrote: > Relying om mm_users is not sufficient AFAICT. Let's say on CPU0 you have > a kernel thread running with the previous user pgd and ASID set in > ttbr0_el1. The mm_users would still be 1 since only mm_count is > incremented in context_switch(). If the user thread now runs on CPU1, a > local tlbi would only invalidate the TLBs on CPU1. However, CPU0 may > still walk (speculatively) the user page tables. > > An example where this matters is a group of small pages converted to a > huge page. If CPU0 already has some TLB entries for small pages in the > group but, not being aware of a TLBI for the ptes in the range, may read > a block pmd entry (huge page) and we end up with a TLB conflict on CPU0 > (CPU1 is fine since you do the local tlbi). > > There are other examples where this could go wrong as the hardware may > keep intermediate pgtable entries in a walk cache. In the arm64 kernel > we rely on something the architecture calls break-before-make for any > page table updates and these need to be broadcast to other CPUs that may > potentially have an entry in their TLB. > > It may be better if you used mm_cpumask to mark wherever an mm ever ran > than relying on mm_users. Agreed. If we can use mm_cpumask to track where the mm ever run, then if I'm not mistaken we could optimize also multithreaded processes in the same way: if only one thread is running frequently and the others are frequently sleeping, we could issue a single tlbi broadcast (modulo invalidates of small virtual ranges). In the meantime the below should be enough to address the concern you raised of the proof of concept RFC patch. I already experimented with mm_users == 1 earlier and it doesn't change the benchmark results for the "best case" below. (untested) diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index 772bbc45b867..a2d53b301f22 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -169,7 +169,8 @@ static inline void flush_tlb_mm(struct mm_struct *mm) unsigned long asid = __TLBI_VADDR(0, ASID(mm)); /* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */ - if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) { + if (current->mm == mm && atomic_read(&mm->mm_users) <= 1 && + (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) { int cpu = get_cpu(); cpumask_setall(mm_cpumask(mm)); @@ -177,7 +178,9 @@ static inline void flush_tlb_mm(struct mm_struct *mm) smp_mb(); - if (atomic_read(&mm->mm_users) <= 1) { + if (atomic_read(&mm->mm_users) <= 1 && + (system_uses_ttbr0_pan() || + atomic_read(&mm->mm_count) == 1)) { dsb(nshst); __tlbi(aside1, asid); __tlbi_user(aside1, asid); @@ -212,7 +215,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm)); /* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */ - if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) { + if (current->mm == mm && atomic_read(&mm->mm_users) <= 1 && + (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) { int cpu = get_cpu(); cpumask_setall(mm_cpumask(mm)); @@ -220,7 +224,9 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, smp_mb(); - if (atomic_read(&mm->mm_users) <= 1) { + if (atomic_read(&mm->mm_users) <= 1 && + (system_uses_ttbr0_pan() || + atomic_read(&mm->mm_count) == 1)) { dsb(nshst); __tlbi(vale1, addr); __tlbi_user(vale1, addr); @@ -264,7 +270,8 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, end = __TLBI_VADDR(end, asid); /* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */ - if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) { + if (current->mm == mm && atomic_read(&mm->mm_users) <= 1 && + (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) { int cpu = get_cpu(); cpumask_setall(mm_cpumask(mm)); @@ -272,7 +279,9 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, smp_mb(); - if (atomic_read(&mm->mm_users) <= 1) { + if (atomic_read(&mm->mm_users) <= 1 && + (system_uses_ttbr0_pan() || + atomic_read(&mm->mm_count) == 1)) { dsb(nshst); for (addr = start; addr < end; addr += stride) { if (last_level) { > That's a pretty artificial test and it is indeed improved by this patch. > However, it would be nice to have some real-world scenarios where this > matters. I don't know exactly how much we should rely on the hardware to snoop the asid on NUMA. The hardware to fully optimize would need to implement a replicated mm_cpumask bitflag for each asid and every CPU would need to tell every other CPU which asid it is loading every time it is loading it. Exactly what x86 does with mm_cpumask in software. That is ideal, but is it an arch requirement to add the above in all implementations? The case I measured has a single socket so it's even simpler because it could be optimized all in-core. Even with a single socket I'm not sure what's going wrong in the chip: it felt like it's the engine that does the broadcast that runs serially system wide and then all CPUs have to wait on it. Still your question if it'll make a difference in practice is a good one and I don't have a sure answer yet. I suppose before doing more benchmarking it's better to make a new version of this that uses mm_cpumask to track where the asid was ever loaded as you suggested, so that it will also optimize away tlbi broadcaasts from multithreaded processes where only one thread is running frequently? Thanks! Andrea