Hi Catalin, On Mon, Mar 09, 2020 at 11:22:42AM +0000, Catalin Marinas wrote: > IIUC, nr_active_mm keeps track of how many instances of the current pgd > (TTBR0_EL1) are active. Correct. > And this code here can assume that if nr_active_mm <= 1, no broadcast is > necessary. Yes. > One concern I have is the ordering between TTBR0_EL1 update in > cpu_do_switch_mm() and the nr_active_mm, both on a different CPU. We > only have an ISB for context synchronisation on that CPU but I don't > think the architecture guarantees any relation between sysreg access and > the memory update. We have a DSB but that's further down in switch_to(). There are several cpu_do_switch_mm updating TTBR0_EL1 and nr_active_mm updates that can happen on different CPUs simultaneously. It's hard to see exactly which one you refer to. Overall the idea here is that even if a speculative tlb lookup happens in between those updates while the "mm" is going away and atomic_dec(&mm->nr_active_mm) is being called on the mm, it doesn't matter because no userland software can use such stale tlb anymore until local_flush_tlb_asid() gets rid of it. The v1 patch (before I posted the incremental mm_count check) had issues with speculatively loaded stale tlb entries only because they weren't guaranteed to be flushed when the kernel thread switched back to an userland process. So it relied on the CPU not to speculatively load random pagetables while the kernel thread was running in lazy tlb mode, but here the flush is guaranteed and in turn the CPU can always load any TLB it wants at any given time. > However, what worries me more is that you can now potentially do a TLB > shootdown without clearing the intermediate (e.g. VA to pte) walk caches > from the TLB. Even if the corresponding pgd and ASID are no longer > active on other CPUs, I'm not sure it's entirely safe to free (and > re-allocate) pages belonging to a pgtable without first flushing the With regard to not doing a tlbi broadcast, nothing fundamentally changed between v1 (single threaded using mm_users) and the latest version (multhtreaded introducing nr_active_mm). The v1 only skipped the tlbi broadcast in remote CPUs that run the asid of a single threaded process before a CPU migration, but the pages could already be reallocated from the point of view of the remote CPUs. In the current version the page can be reallocated even from the point of view of the current CPU. However the fact the window has been enlarged significantly should be a good thing, so if there would have been something wrong with it, it would have been far easier to reproduce it. This concern is still a corollary of the previous paragraph: it's still about stale tlb entries being left in an asid that can't ever be used through the current asid. > TLB. All the architecture spec states is that the software must first > clear the entry followed by TLBI (the break-before-make rules). The "break" in "break-before-make" is still guaranteed or it wouldn't boot: however it's not implemented with the tlbi broadcast anymore. The break is implemented by enforcing no stale tlb entry of such asid exists in the TLB while any userland code runs. X86 specs supposed an OS would allocate a TSS per-process and you would do a context switch by using a task gate. I recall the first Linux version I used had a TSS per process as envisioned by the specs. Later the TSS become per-CPU and the esp0 pointer was updated instead (native_load_sp0) and the stack was switched by hand. I guess reading the specs may result confusing after such a software change, that doesn't mean the software shouldn't optimize things behind the specs if it's safe to do and it's not explicitly forbidden. The page being reused by another virtual address in another CPU isn't necessarily an invalid scenario from the point of view of the CPU. It looks invalid if you assume the page is freed. You can think of it like a MAP_SHARED page that gets one more mapping associated to it (the reuse of the page) from another CPU it may look more legit. The fact there's an old mapping left on the MAP_SHARED pagecache indefinitely doesn't mean the CPU with such old mapping left in the TLB is allowed to change the content of the page if the software never writes to such virtual address through the old mapping. The important thing is that the content of the page must not change unless the software running in the CPU explicitly writes through the virtual address that corresponds to the stale TLB entry (and it's guaranteed the software won't write to it). The stale TLB of such asid eventually is flushed either through a bumb of the asid generation or through a local asid flush. > That said, the benchmark numbers are not very encouraging. Around 1% > improvement in a single run, it can as well be noise. Also something > like hackbench may also show a slight impact on the context switch path. I recall I tested hackbench and it appeared faster with processes, otherwise it was within measurement error. hackbench with processes is fork heavy so it gets some benefit because all those copy-on-write post fork will trigger tlbi broadcasts on all CPUs to flush the wrprotected tlb entry. Specifically the one converted to local tlb flush is ptep_clear_flush_notify in wp_page_copy() and there's one for each page being modified by parent or child. > Maybe with a true NUMA machine with hundreds of CPUs we may see a > difference, but it depends on how well the TLBI is implemented. The numbers in the commit header were not in a single run. perf stat -r 10 -e dummy runs it 10 times and then shows the stdev along the number too so you can what the noise was. It wasn't only a 1% improvement either. Overall there's no noise in the measurement. tcmalloc_large_heap_fragmentation_unittest simulating dealing with small objects by many different containers at the same time, was 9.4% faster (%stdev 0.36% 0.29%), with 32 CPUs and no NUMA. 256 times parallel run of 32 `sort` in a row, was 10.3% faster (%stdev 0.77% 0.38%), with 32 CPUs and no NUMA. The multithreaded microbenchmark runs x16 times faster, but that's not meaningful by itself, it's still a good hint some real life workload (especially those with frequent MADV_DONTNEED) will run faster and they did (and a verification that now also multithreaded apps can be optimized). Rafael already posted a benchmark specifically stressing the context switch. It's reasonable to expect any multi-socket NUMA to show more benefit from the optimization than the 32 CPU non NUMA used for the current benchmarks. Thanks, Andrea