On Thu, Oct 03, 2013 at 04:28:27PM -0700, Josh Triplett wrote: > On Thu, Oct 03, 2013 at 01:52:45PM -0700, Linus Torvalds wrote: > > On Thu, Oct 3, 2013 at 1:41 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > > > > > The problem is this: > > > A = 1, B = 1 > > > CPU1: > > > A = 0 > > > <full barrier> > > > synchronize_rcu() > > > read B > > > > > > CPU2: > > > rcu_read_lock() > > > B = 0 > > > read A /me scratches his head... OK, for CPU2 to see 1 from its read from A, the corresponding RCU read-side critical section must have started before CPU1 did A=0. This means that this same RCU read-side critical section must have started before CPU1's synchronize_rcu(), which means that it must complete before that synchronize_rcu() returns. Therefore, CPU2's B=0 must execute before CPU1's read of B, hence that read of B must return zero. Conversely, if CPU1's read from B returns 1, we know that CPU2's RCU read-side critical section must not have completed until after CPU1's synchronize_rcu() returned, which means that the RCU read-side critical section must have started after that synchronize_rcu() started, so CPU1's assignment to A must also have already happened. Therefore, CPU2's read from A must return zero. > > > Are we guaranteed that we won't get both of them seeing ones, in situation > > > when that rcu_read_lock() comes too late to be noticed by synchronize_rcu()? Yes, at least one of CPU1's and CPU2's reads must return zero. For whatever it is worth, it is easier to talk about if you write it this way (easier for me, anyway): A = 1, B = 1 CPU1: A = 0 <full barrier> synchronize_rcu() r1 = B CPU2: rcu_read_lock() B = 0 r2 = A Then we are guaranteed that r1==0||r2==0. > > Yeah, I think we should be guaranteed that, because the > > synchronize_rcu() will guarantee that all other CPU's go through an > > idle period. So the "read A" on CPU2 cannot possibly see a 1 _unless_ > > it happens so early that synchronize_rcu() definitely sees it (ie it's > > a "preexisting reader" by definition), in which case synchronize_rcu() > > will be waiting for a subsequent idle period, in which case the B=0 on > > CPU2 is not only guaranteed to happen but also be visible out, so the > > "read B" on CPU1 will see 0. And that's true even if CPU2 doesn't have > > an explicit memory barrier, because the "RCU idle" state implies that > > it has gone through a barrier. I agree that the "<full barrier>" has no effect in this case. The memory-barrier guarantees of RCU's grace-period primitives are documented in the excessively long header comment for synchronize_sched(), which documents an extended LKML conversation between Oleg Nesterov and myself. Here is a summary, which applies on systems with more than one CPU: o When synchronize_sched() returns, each CPU is guaranteed to have executed a full memory barrier since the end of its last RCU-sched read-side critical section whose beginning preceded the call to synchronize_sched(). o Each CPU having an RCU read-side critical section that extends beyond the return from synchronize_sched() is guaranteed to have executed a full memory barrier after the beginning of synchronize_sched() and before the beginning of that RCU read-side critical section. o If CPU A invoked synchronize_sched(), which returned to its caller on CPU B, then both CPU A and CPU B are guaranteed to have executed a full memory barrier during the execution of synchronize_sched(). This also applies when CPUs A and B are the same CPU. These guarantees apply to -all- CPUs, even those that are currently offline or idle. Analogous guarantees of course apply to all flavors of RCU. > I think the reasoning in one direction is actually quite a bit less > obvious than that. > > rcu_read_unlock() does *not* necessarily imply a memory barrier (so the > B=0 can actually move logically outside the rcu_read_unlock()), but > synchronize_rcu() *does* imply (and enforce) that a memory barrier has > occurred on all CPUs as part of quiescence. However, likewise, > rcu_read_lock() doesn't imply anything in particular about writes; it > does enforce either that reads can't leak earlier or that if they do a > synchronize_rcu() will still wait for them, but I don't think the safety > interaction between a *write* in the RCU reader and a *read* in the RCU > writer necessarily follows from that enforcement. > > (Also, to the best of my knowledge, you don't even need a barrier on > CPU1; synchronize_rcu() should imply one.) > > If synchronize_rcu() on CPU1 sees rcu_read_lock() on CPU2, then > synchronize_rcu() will wait for CPU2's read-side critical section and a > memory barrier before reading B, so CPU1 will see B==0. > > The harder direction: If synchronize_rcu() on CPU1 does not see > rcu_read_lock() on CPU2, then it won't necessarily wait for anything, > and since rcu_read_lock() itself does not imply any CPU write barriers, > it's not at all obvious that rcu_read_lock() prevents B=0 from occurring > before CPU1's read of B. > > In short, the interaction between RCU's ordering guarantees and CPU > memory barriers in the presence of writes on the read side and reads on > the write side does not seem sufficiently clear to support the portable > use of the above pattern without an smp_wmb() on CPU2 between > rcu_read_lock() and B=0. I think it might happen to work with the > current implementations of RCU (with which synchronize_rcu() won't > actually notice a quiescent state and return until after either > the rcu_read_unlock() or a preemption point), but by the strict semantic > guarantees of the RCU primitives I think you could write a legitimate > RCU implementation that would break the above code. > > That said, I believe this pattern *will* work with every existing > implementation of RCU. Thus, I'd suggest documenting it as a warning to > prospective RCU optimizers to avoid breaking the above pattern. I believe that the comment headers for synchronize_sched() and call_rcu() do document the required restrictions. Could you please take a look at them and let me know if I left some wiggle room that needs to be taken care of? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html