Excerpts from Peter Zijlstra's message of December 3, 2020 6:44 pm: > On Wed, Dec 02, 2020 at 09:25:51PM -0800, Andy Lutomirski wrote: > >> power: same as ARM, except that the loop may be rather larger since >> the systems are bigger. But I imagine it's still faster than Nick's >> approach -- a cmpxchg to a remote cacheline should still be faster than >> an IPI shootdown. > > While a single atomic might be cheaper than an IPI, the comparison > doesn't work out nicely. You do the xchg() on every unlazy, while the > IPI would be once per process exit. > > So over the life of the process, it might do very many unlazies, adding > up to a total cost far in excess of what the single IPI would've been. Yeah this is the concern, I looked at things that add cost to the idle switch code and it gets hard to justify the scalability improvement when you slow these fundmaental things down even a bit. I still think working on the assumption that IPIs = scary expensive might not be correct. An IPI itself is, but you only issue them when you've left a lazy mm on another CPU which just isn't that often. Thanks, Nick