On Wed, Dec 02, 2020 at 09:25:51PM -0800, Andy Lutomirski wrote: > power: same as ARM, except that the loop may be rather larger since > the systems are bigger. But I imagine it's still faster than Nick's > approach -- a cmpxchg to a remote cacheline should still be faster than > an IPI shootdown. While a single atomic might be cheaper than an IPI, the comparison doesn't work out nicely. You do the xchg() on every unlazy, while the IPI would be once per process exit. So over the life of the process, it might do very many unlazies, adding up to a total cost far in excess of what the single IPI would've been. And while I appreciate all the work to get rid of the active_mm accounting; the worry I have with pushing this all into arch code is that it will be so very easy to get this subtly wrong.