On Tue, Jan 05, 2021 at 09:22:51PM +0000, Nadav Amit wrote: > It is also about performance due to unwarranted TLB flushes. If there will be a problem switching to the wait_flush_pending() model suggested by Peter may not even require changes to the common code in memory.c since I'm thinking it may not even need to take a failure path if we plug it in the same place of the tlb flush. So instead of the flush we could always block there until we read zero in the atomic, then smp_rmb() and we're ready to start the copy. So either we flush IPI if we didn't read zero, or we block until we read zero, the difference is going to be hidden to do_wp_page. All do_wp_page cares about is that by the time the abstract call returns, there's no stale TLB left for such pte. If it is achieved by blocking and waiting or flushing the TLB it doesn't matter too much. So thinking of how bad the IPI will do, with the improved arm64 tlb flushing code in production, we keep track of how many simultaneous mm context there are, specifically to skip the SMP-unscalable TLBI broadcast on arm64 like we already avoid IPIs on lazy tlbs on x86 (see x86 tlb_is_not_lazy in native_flush_tlb_others). In other words the IPI will materialize only if there's more than one thread running while clear_refs run. All lazy tlbs won't get IPIs on both x86 upstream and arm64 enterprise. This won't help multithreaded processes that compute from all CPUs at all times but even multiple vcpu threads aren't always guaranteed to be running at all times. My main concern would be an IPI flood that slowdown clear_refs and UFFDIO_WRITEPROTECT, but an incremental optimization (not required for correctness) is to have UFFDIO_WRITEPROTECT and clear_refs switch to lazy tlb mode before they call inc_tlb_flush_pending() and unlazy only after dec_tlb_flush_pending. So it's possible to at least guarantee the IPI won't slow down them down. > In addition, as I stated before, having some clean interfaces that tell > whether a TLB flush is needed or not would be helpful and simpler to follow. > For instance, we can have is_pte_prot_demotion(oldprot, newprot) to figure > out whether a TLB flush is needed in change_pte_range() and avoid > unnecessary flushes when unprotecting pages with either mprotect() or > userfaultfd. When you mentioned this earlier I was thinking what happens then with flush_tlb_fix_spurious_fault(). The fact it's safe doesn't guarantee it's a performance win if there's a stream of spurious faults as result. So it'd need to be checked, especially as in the case of mprotect where the flush can be deferred and coalesced in a single IPI at the end so there's not so much to gain from it anyway. If you can guarantee there won't be a flood suprious wrprotect faults, then it'll be a nice optimization. Thanks, Andrea