On Thu, Nov 21, 2013 at 04:09:49PM -0800, Linus Torvalds wrote: > On Thu, Nov 21, 2013 at 2:52 PM, Paul E. McKenney > <paulmck@xxxxxxxxxxxxxxxxxx> wrote: > > > > Actually, the weakest forms of locking only guarantee a consistent view > > of memory if you are actually holding the lock. Not "a" lock, but "the" > > lock. > > I don't think we necessarily support any architecture that does that, > though. And afaik, it's almost impossible to actually do sanely in > hardware with any sane cache coherency, so.. It is not the architecture that matters here, it is just a definition of what ordering guarantees the locking primitives provide, independent of the architecture. In the Linux kernel, we just happen to have picked a strongly ordered definition of locking. And we have probably made the right choice. But there really are environments where unlock+lock does -not- provide a full memory barrier. If you never refer to any lock-protected memory without holding the lock, this difference doesn't matter to you -- there will be nothing you can do to detect the difference under this restriction. Of course, if you want to scale, you -will- find yourself accessing lock-protected memory without holding locks, so in the Linux kernel, this difference in unlock+lock ordering semantics really does matter. > So realistically, I think we only really need to worry about memory > ordering that is tied to cache coherency protocols, where even locking > rules tend to be about memory ordering (although extended rules like > acquire/release rather than the broken pure barrier model). > > Do you know any actual architecture where this isn't the case? You can implement something that acts like a lock (just not like a -Linux- -kernel- lock) but where unlock+lock does not provide a full barrier on architectures that provide weak memory barriers. And there are software environments that provide these weaker locks. Which does -not- necessarily mean that the Linux kernel should do this, of course! ??? OK, part of the problem is that this discussion has spanned at least three different threads over the past week or so. I will try to summarize. Others can correct me if I blow it. a. We found out that some of the circular-buffer code is unsafe. I first insisted on inserting a full memory barrier, but it was eventually determined that weaker memory barriers could suffice. This was the thread where I proposed the smp_tmb(), which name you (quite rightly) objected to. This proposal eventually morphed into smp_load_acquire() and smp_store_release(). b. For circular buffers, you need really minimal ordering semantics out of smp_load_acquire() and smp_store_release(). In particular, on powerpc, the lwsync instruction suffices. Peter therefore came up with an implementation matching those weak semantics. c. In a couple of threads involving MCS lock, I complained about insufficient barriers, again suggesting adding smp_mb(). (Almost seems like a theme here...) Someone noted that ARM64's shiny new load-acquire and store-release instructions sufficed (which does appear to be true). Tim therefore came up with an implementation based on Peter's smp_load_acquire() and smp_store_release(). The smp_store_release() is used when the lock holder hands off to the next in the queue, and the smp_load_acquire() is used when the next in the queue notices that the lock has been handed off. So we really are talking about the unlock+lock case!!! d. Unfortunately (or fortunately, depending on your viewpoint), locking as defined by the Linux kernel requires stronger smp_load_acquire() and smp_store_release() semantics than are required by the circular buffer. In particular, the weaker smp_load_acquire() and smp_store_release() semantics provided by the original powerpc implementation do not provide a full memory barrier for the unlock+lock handoff on the queue. I (finally) noticed this and complained. e. The question then was "how to fix this?" There are a number of ways, namely these guys from two emails ago plus one more: > > So the three fixes I know of at the moment are: > > > > 1. Upgrade smp_store_release()'s PPC implementation from lwsync > > to sync. We are going down this path. I produced what I believe to be a valid proof that the x86 versions do provide a full barrier for unlock+lock, which Peter will check tomorrow (Friday) morning. Itanium is trivial (famous last words), ARM is also trivial (again, famous last words), and if x86 works, then so does s390. And so on. My alleged proof for x86 is here, should anyone else like to take a crack at it: http://www.spinics.net/lists/linux-mm/msg65462.html > > What about ARM? ARM platforms that have the load-acquire and > > store-release instructions could use them, but other ARM > > platforms have to use dmb. ARM avoids PPC's lwsync issue > > because it has no equivalent to lwsync. > > > > 2. Place an explicit smp_mb() into the MCS-lock queued handoff > > code. This would allow unlock+lock to be a full memory barrier, but would allow the weaker and cheaper semantics for smp_load_acquire() and smp_store_release. > > 3. Remove the requirement that "unlock+lock" be a full memory > > barrier. This would allow cheaper locking primitives on some architectures, but would require more care when making unlocked accesses to variables protected by locks. 4. Have two parallel APIs, smp_store_release_weak(), smp_store_release(), and so on. My reaction to this is "just say no". It is not like we exactly have a shortage of memory-barrier APIs at the moment. > > We have been leaning towards #1, but before making any hard decision > > on this we are looking more closely at what the situation is on other > > architectures. > > So I might be inclined to lean towards #1 simply because of test coverage. Another reason is "Who knows what code in the Linux kernel might be relying on unlock+lock providing a full barrier?" > We have no sane test coverage of weakly ordered models. Sure, ARM may > be weakly ordered (with saner acquire/release in ARM64), but > realistically, no existing ARM platforms actually gives us any > reasonable test *coverage* for things like this, despite having tons > of chips out there running Linux. Very few people debug problems in > that world. The PPC people probably have much better testing and are > more likely to figure out the bugs, but don't have the pure number of > machines. So x86 tends to still remain the main platform where serious > testing gets done. We clearly need something like rcutorture, but for locking. No two ways about it. But we need that regardless of whether or not we change the ordering semantics of unlock+lock. > That said, I'd still be perfectly happy with #3, since - unlike, say, > the PCI ordering issues with drivers - at least people *can* try to > think about this somewhat analytically, even if it's ripe for > confusion and subtle mistakes. And I still think you got the ordering > wrong, and should be talking about "lock+unlock" rather than > "unlock+lock". No, I really am talking about unlock+lock. The MCS queue handoff is an unlock followed by a lock, and that is what got weakened to no longer imply a full memory barrier on all architectures when the MCS patch started using smp_load_acquire() and smp_store_release(). Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>