Dear GCC experts, I'm trying to find the answer to this question about the way that C11 (or for that matter, C99) does (not) guarantee the thread-orthogonality of various memory accesses in a cross-platform manner. Suppose I have integers (in the abstract sense, not necessarily (int)s) X and Y. They are both aligned on P-bit boundaries, such that the X appears exactly Q bits before Y. (P and Q are generally powers of 2 and equal, but not necessarily.) How do I find the minimum values of P and Q, in a cross-platform manner, such that accesses to X and Y will never affect one another due to a race? It may seem obvious that if X and Y are anything at least as large as a byte, then this will simply never happen. But I'm not so sure because: -- It's conceivable that someone could design a very efficient chip in which it was assumed that no variables belonging to different threads could ever share a cache line, and presumably C99/C11 could be implemented on such a chip. While I could access individual bytes, I could not access the same cache line in different threads without creating a race. So I'd somehow need to preempt this possibility by ensuring sufficient distance between variables belonging to different threads. (Or maybe one just couldn't implement C on such a chip, so it's a moot issue.) -- I know I can use all the atomic data types in C11. I don't want to do that, because it explicitly forces single-threadedness. What I'm trying to accomplish here is guaranteed thread orthogonality by virtue of sufficiently large (a) memory alignment and (b) memory access granularity. -- Empirically, using unpacked (struct)s doesn't accomplish the goal because the unpacking granularity is below the maximum CPU word size. -- Empirically, (malloc)ing orthogonal memory regions for different threads does in fact accomplish the goal, at high expense due to allocation overhead and pointer dereferencing, as opposed to straightforward array indexing. -- I'm particularly worried that memmove(), memcpy(), or some other builtin will try to be efficient and move large chunks of data at a time, without realizing that the source or destination alignment is unfriendly to such chunk size. This could result in read-modify-write conflicts with neighboring thread domains. -- Assuming that P and Q can be found, then it looks like a major pain to try to enforce this at build time, i.e. where's the pad-struct-to-mulitiple-of-Q macro? I suspect that there's a rule somewhere that makes all this simple and clear. I've tried searching, but all I find are comments about atomic data types and not doing stupid stuff like assuming timing consistency. Please chime in if you can offer insight (or even if you know that I'm sunk because there is no such guarantee and it's all hardware-dependent). Veiokej