Hi Andrea, Thanks for having a go at this. On Mon, Feb 03, 2020 at 03:17:45PM -0500, Andrea Arcangeli wrote: > With multiple NUMA nodes and multiple sockets, the tlbi broadcast > shall be delivered through the interconnects in turn increasing the > interconnect traffic and the latency of the tlbi broadcast instruction. There are better ways to handle this in hardware (snoop filters), so hopefully those vendors will eventually learn. > After the local TLB flush this however means the ASID context goes out > of sync in all CPUs except the local one. This can be tracked in the > mm_cpumask(mm): if the bit is set it means the asid context is stale > for that CPU. This results in an extra local ASID TLB flush only if a > single threaded process is migrated to a different CPU and only after a > TLB flush. No extra local TLB flush is needed for the common case of > single threaded processes context scheduling within the same CPU and for > multithreaded processes. Relying om mm_users is not sufficient AFAICT. Let's say on CPU0 you have a kernel thread running with the previous user pgd and ASID set in ttbr0_el1. The mm_users would still be 1 since only mm_count is incremented in context_switch(). If the user thread now runs on CPU1, a local tlbi would only invalidate the TLBs on CPU1. However, CPU0 may still walk (speculatively) the user page tables. An example where this matters is a group of small pages converted to a huge page. If CPU0 already has some TLB entries for small pages in the group but, not being aware of a TLBI for the ptes in the range, may read a block pmd entry (huge page) and we end up with a TLB conflict on CPU0 (CPU1 is fine since you do the local tlbi). There are other examples where this could go wrong as the hardware may keep intermediate pgtable entries in a walk cache. In the arm64 kernel we rely on something the architecture calls break-before-make for any page table updates and these need to be broadcast to other CPUs that may potentially have an entry in their TLB. It may be better if you used mm_cpumask to mark wherever an mm ever ran than relying on mm_users. > Skipping the tlbi instruction broadcasting is already implemented in > local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(), > flush_tlb_range() and flush_tlb_page() too. > > Here's the result of 32 CPUs (ARMv8 Ampere) running mprotect at the same > time from 32 single threaded processes before the patch: > > Performance counter stats for './loop' (3 runs): > > 0 dummy > > 2.121353 +- 0.000387 seconds time elapsed ( +- 0.02% ) > > and with the patch applied: > > Performance counter stats for './loop' (3 runs): > > 0 dummy > > 0.1197750 +- 0.0000827 seconds time elapsed ( +- 0.07% ) That's a pretty artificial test and it is indeed improved by this patch. However, it would be nice to have some real-world scenarios where this matters. -- Catalin