On Thu, Feb 20, 2014 at 9:49 AM, Torvald Riegel <triegel@xxxxxxxxxx> wrote: > > Yes, mo_consume is more tricky than mo_acquire. > > However, that has an advantage because you can avoid getting stronger > barriers if you don't need them (ie, you can avoid the "auto-update to > acquire" you seem to have in mind). Oh, I agree about that part - I very much understand the reason for "consume", and I can see how it is more relaxed than "acquire" under many circumstances. I just think that you actually *do* want to have "consume" even for flag values, exactly *because* it is potentially cheaper than acquire. In fact, I'd argue that making consume reliable in the face of control dependencies is actually a *good* thing. It may not matter for something like x86, where consume and acquire end up with the same simple load, but even there it might relax instruction scheduling a bit, since a 'consume' would have a barrier just to the *users* of the value loaded, while 'acquire' would still have a scheduling barrier to any subsequent operations. So I claim that for a sequence like my example, where the reader basically does something like load_atomic(&initialized, consume) ? value : -1; the "consume" version can actually generate better code than "acquire" - if "consume" is specified the way *I* specified it. The way the C standard specifies it, the above code is *buggy*. Agreed? It's really really subtly buggy, and I think that bug is not only a real danger, I think it is logically hard to understand why. The bug only makes sense to people who understand how memory ordering and branch prediction interacts. The way *I* suggested "consume" be implemented, the above not only works and is sensible, it actually generates possibly better code than forcing the programmer to use the (illogical) "acquire" operation. Why? Let me give you another - completely realistic, even if obviously a bit made up - example: int return_expensive_system_value(void) { static atomic_t initialized; static int calculated; if (atomic_read(&initialized, mo_consume)) return calculated; //let's say that this code opens /proc/cpuinfo and counts number of CPU's or whatever ... calculated = read_value_from_system_files(); atomic_write(&initialized, 1, mo_release); return calculated; } and let's all agree that this is a somewhat realistic example, and we can imagine why/how somebody would write code like this. It's basically a very common lazy initialization pattern, you'll find this in libraries, in kernels, in application code yadda yadda. No argument? Now, let's assume that it turns out that this value ends up being really performance-critical, so the programmer makes the fast-path an inline function, tells the compiler that "initialized" read is likely, and generally wants the compiler to optimize it to hell and back. Still sounds reasonable and realistic? In other words, the *expected* code sequence for this is (on x86, which doesn't need any barriers): cmpl $0, initialized je unlikely_out_of_line_case movl calculated, eax and on ARM/power you'd see a 'sync' instruction or whatever. So far 'acquire' and 'consume' have exacly the same code generation on power of x86, so your argument can be: "Ok, so let's just use the inconvenient and hard-to-understand 'consume' semantics that the current standard has, and tell the programmer that he should use 'acquire' and not worry his little head about the difference because he will never understand it anyway". Sure, that would be an inconvencience for programmers, but hey, they're programming in C or C++, so they are *expected* to be manly men or womanly women, and a little illogical inconvenience never hurt anybody. After all, compared to the aliasing rules, that "use acquire, not consume" rule is positively *simple*, no? Are we all in agreement so far? But no, the "consume" thing can actually generate better code. Trivial example: int my_threads_value; extern int magic_system_multiplier; my_thread_value = return_expensive_system_value(); my_thread_value *= magic_system_multiplier; and in the "acquire" model, the "acquire" itself means that the load from magic_system_multiplier is now constrained by the acquire memory ordering on "initialized". While in my *sane* model, where you can consume things even if they then result in control dependencies, there will still eventually be a "sync" instruction on powerpc (because you really need one between the load of 'initialized' and the load of 'calculated'), but the compiler would be free to schedule the load of 'magic_system_multiplier' earlier. So as far as I can tell, we want the 'consume' memory ordering to honor *all* dependencies, because - it's simpler - it's more logical - it's less error-prone - and it allows better code generation Hmm? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html