On Fri, Dec 14, 2018 at 04:39:34PM -0500, Alan Stern wrote: > On Fri, 14 Dec 2018, Paul E. McKenney wrote: > > > I would say that sys_membarrier() has zero-sized read-side critical > > sections, either comprising a single instruction (as is the case for > > synchronize_sched(), actually), preempt-disable regions of code > > (which are irrelevant to userspace execution), or the spaces between > > consecutive pairs of instructions (as is the case for the newer > > IPI-based implementation). > > > > The model picks the single-instruction option, and I haven't yet found > > a problem with this -- which is no surprise given that, as you say, > > an actual implementation makes this same choice. > > I believe that for RCU tests the LKMM gives the same results for > length-zero critical sections interspersed between all the instructions > and length-one critical sections surrounding all instructions (except > synchronize_rcu). But the proof is tricky and I haven't checked it > carefully. That assertion is completely consistent with my implementation experience, give or take the usual caveats about idle and offline execution. > > > > The other thing that took some time to get used to is the possibility > > > > of long delays during sys_membarrier() execution, allowing significant > > > > execution and reordering between different CPUs' IPIs. This was key > > > > to my understanding of the six-process example, and probably needs to > > > > be clearly called out, including in an example or two. > > > > > > In all the examples I'm aware of, no more than one of the IPIs > > > generated by each sys_membarrier call really matters. (Of course, > > > there's no way to know in advance which one it will be, so you have to > > > send an IPI to every CPU.) The execution delays and reordering > > > between different CPUs' IPIs don't appear to be significant. > > > > Well, there are litmus tests that are allowed in which the allowed > > execution is more easily explained in terms of delays between different > > CPUs' IPIs, so it seems worth keeping track of. > > > > There might be a litmus test that can tell the difference between > > simultaneous and non-simultaneous IPIs, but I cannot immediately think of > > one that matters. Might be a failure of imagination on my part, though. > > P0 P1 P2 > Wc=1 [mb01] Rb=1 > memb Wa=1 Rc=0 > Ra=0 Wb=1 [mb02] > > The IPIs have to appear in the positions shown, which means they cannot > be simultaneous. The test is allowed because P2's reads can be > reordered. OK, so "simultaneous" IPIs could be emulated in a real implementation by having sys_membarrier() send each IPI (but not wait for a response), then execute a full memory barrier and set a shared variable. Each IPI handler would spin waiting for the shared variable to be set, then execute a full memory barrier and atomically increment yet another shared variable and return from interrupt. When that other shared variable's value reached the number of IPIs sent, the sys_membarrier() would execute its final (already existing) full memory barrier and return. Horribly expensive and definitely not recommended, but eminently doable. The difference between current sys_membarrier() and the "simultaneous" variant described above is similar to the difference between non-multicopy-atomic and multicopy-atomic memory ordering. So, after thinking it through, my guess is that pretty much any litmus test that can discern between multicopy-atomic and non-multicopy-atomic should be transformable into something that can distinguish between the current and the "simultaneous" sys_membarrier() implementation. Seem reasonable? Or alternatively, may I please apply your Signed-off-by to your earlier sys_membarrier() patch so that I can queue it? I will probably also change smp_memb() to membarrier() or some such. Again, within the Linux kernel, membarrier() can be emulated with smp_call_function() invoking a handler that does smp_mb(). Thanx, Paul