On 6/16/21 12:35 AM, Peter Zijlstra wrote: > On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote: >> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm: >>> membarrier() needs a barrier after any CPU changes mm. There is currently >>> a comment explaining why this barrier probably exists in all cases. This >>> is very fragile -- any change to the relevant parts of the scheduler >>> might get rid of these barriers, and it's not really clear to me that >>> the barrier actually exists in all necessary cases. >> >> The comments and barriers in the mmdrop() hunks? I don't see what is >> fragile or maybe-buggy about this. The barrier definitely exists. >> >> And any change can change anything, that doesn't make it fragile. My >> lazy tlb refcounting change avoids the mmdrop in some cases, but it >> replaces it with smp_mb for example. > > I'm with Nick again, on this. You're adding extra barriers for no > discernible reason, that's not generally encouraged, seeing how extra > barriers is extra slow. > > Both mmdrop() itself, as well as the callsite have comments saying how > membarrier relies on the implied barrier, what's fragile about that? > My real motivation is that mmgrab() and mmdrop() don't actually need to be full barriers. The current implementation has them being full barriers, and the current implementation is quite slow. So let's try that commit message again: membarrier() needs a barrier after any CPU changes mm. There is currently a comment explaining why this barrier probably exists in all cases. The logic is based on ensuring that the barrier exists on every control flow path through the scheduler. It also relies on mmgrab() and mmdrop() being full barriers. mmgrab() and mmdrop() would be better if they were not full barriers. As a trivial optimization, mmgrab() could use a relaxed atomic and mmdrop() could use a release on architectures that have these operations. Larger optimizations are also in the works. Doing any of these optimizations while preserving an unnecessary barrier will complicate the code and penalize non-membarrier-using tasks. Simplify the logic by adding an explicit barrier, and allow architectures to override it as an optimization if they want to. One of the deleted comments in this patch said "It is therefore possible to schedule between user->kernel->user threads without passing through switch_mm()". It is possible to do this without, say, writing to CR3 on x86, but the core scheduler indeed calls switch_mm_irqs_off() to tell the arch code to go back from lazy mode to no-lazy mode.