On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote: > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote: > > On 02/06/14 18:25, David Howells wrote: > > > > > > Is it worth considering a move towards using C11 atomics and barriers and > > > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > > > these. > > > > > > It sounds interesting to me, if we can make it work properly and > > reliably. + gcc@xxxxxxxxxxx for others in the GCC community to chip in. > > Given my (albeit limited) experience playing with the C11 spec and GCC, I > really think this is a bad idea for the kernel. I'm not going to comment on what's best for the kernel (simply because I don't work on it), but I disagree with several of your statements. > It seems that nobody really > agrees on exactly how the C11 atomics map to real architectural > instructions on anything but the trivial architectures. There's certainly different ways to implement the memory model and those have to be specified elsewhere, but I don't see how this differs much from other things specified in the ABI(s) for each architecture. > For example, should > the following code fire the assert? I don't see how your example (which is about what the language requires or not) relates to the statement about the mapping above? > > extern atomic<int> foo, bar, baz; > > void thread1(void) > { > foo.store(42, memory_order_relaxed); > bar.fetch_add(1, memory_order_seq_cst); > baz.store(42, memory_order_relaxed); > } > > void thread2(void) > { > while (baz.load(memory_order_seq_cst) != 42) { > /* do nothing */ > } > > assert(foo.load(memory_order_seq_cst) == 42); > } > It's a good example. My first gut feeling was that the assertion should never fire, but that was wrong because (as I seem to usually forget) the seq-cst total order is just a constraint but doesn't itself contribute to synchronizes-with -- but this is different for seq-cst fences. > To answer that question, you need to go and look at the definitions of > synchronises-with, happens-before, dependency_ordered_before and a whole > pile of vaguely written waffle to realise that you don't know. Are you familiar with the formalization of the C11/C++11 model by Batty et al.? http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf http://www.cl.cam.ac.uk/~mjb220/n3132.pdf They also have a nice tool that can run condensed examples and show you all allowed (and forbidden) executions (it runs in the browser, so is slow for larger examples), including nice annotated graphs for those: http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ It requires somewhat special syntax, but the following, which should be equivalent to your example above, runs just fine: int main() { atomic_int foo = 0; atomic_int bar = 0; atomic_int baz = 0; {{{ { foo.store(42, memory_order_relaxed); bar.store(1, memory_order_seq_cst); baz.store(42, memory_order_relaxed); } ||| { r1=baz.load(memory_order_seq_cst).readsvalue(42); r2=foo.load(memory_order_seq_cst).readsvalue(0); } }}}; return 0; } That yields 3 consistent executions for me, and likewise if the last readsvalue() is using 42 as argument. If you add a "fence(memory_order_seq_cst);" after the store to foo, the program can't observe != 42 for foo anymore, because the seq-cst fence is adding a synchronizes-with edge via the baz reads-from. I think this is a really neat tool, and very helpful to answer such questions as in your example. > Certainly, > the code that arm64 GCC currently spits out would allow the assertion to fire > on some microarchitectures. > > There are also so many ways to blow your head off it's untrue. For example, > cmpxchg takes a separate memory model parameter for failure and success, but > then there are restrictions on the sets you can use for each. That's in there for the architectures without a single-instruction CAS/cmpxchg, I believe. > It's not hard > to find well-known memory-ordering experts shouting "Just use > memory_model_seq_cst for everything, it's too hard otherwise". Everyone I've heard saying this meant this as advice to people new to synchronization or just dealing infrequently with it. The advice is the simple and safe fallback, and I don't think it's meant as an acknowledgment that the model itself would be too hard. If the language's memory model is supposed to represent weak HW memory models to at least some extent, there's only so much you can do in terms of keeping it simple. If all architectures had x86-like models, the language's model would certainly be simpler... :) > Then there's > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > atm and optimises all of the data dependencies away) AFAIK consume memory order was added to model Power/ARM-specific behavior. I agree that the way the standard specifies how dependencies are to be preserved is kind of vague (as far as I understand it). See GCC PR 59448. > as well as the definition > of "data races", which seem to be used as an excuse to miscompile a program > at the earliest opportunity. No. The purpose of this is to *not disallow* every optimization on non-synchronizing code. Due to the assumption of data-race-free programs, the compiler can assume a sequential code sequence when no atomics are involved (and thus, keep applying optimizations for sequential code). Or is there something particular that you dislike about the specification of data races? > Trying to introduce system concepts (writes to devices, interrupts, > non-coherent agents) into this mess is going to be an uphill battle IMHO. That might very well be true. OTOH, if you whould need to model this uniformly across different architectures (ie, so that there is a intra-kernel-portable abstraction for those system concepts), you might as well try doing this by extending the C11/C++11 model. Maybe that will not be successful or not really a good fit, though, but at least then it's clear why that's the case. > I'd > just rather stick to the semantics we have and the asm volatile barriers. > > That's not to say I don't there's no room for improvement in what we have > in the kernel. Certainly, I'd welcome allowing more relaxed operations on > architectures that support them, but it needs to be something that at least > the different architecture maintainers can understand how to implement > efficiently behind an uncomplicated interface. I don't think that interface is > C11. IMHO, one thing worth considering is that for C/C++, the C11/C++11 is the only memory model that has widespread support. So, even though it's a fairly weak memory model (unless you go for the "only seq-cst" beginners advice) and thus comes with a higher complexity, this model is what likely most people will be familiar with over time. Deviating from the "standard" model can have valid reasons, but it also has a cost in that new contributors are more likely to be familiar with the "standard" model. Note that I won't claim that the C11/C++11 model is perfect -- there are a few rough edges there (e.g., the forward progress guarantees are (or used to be) a little coarse for my taste), and consume vs. dependencies worries me as well. But, IMHO, overall it's the best C/C++ language model we have. Torvald -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html