How to guarantee memory access orthogonality without atomics?

Leonard Plentz <veiokej@xxxxxxxxx> · Fri, 27 Apr 2012 15:49:59 +0700

Dear GCC experts,

I'm trying to find the answer to this question about the way that C11
(or for that matter, C99) does (not) guarantee the
thread-orthogonality of various memory accesses in a cross-platform
manner.

Suppose I have integers (in the abstract sense, not necessarily
(int)s) X and Y. They are both aligned on P-bit boundaries, such that
the X appears exactly Q bits before Y. (P and Q are generally powers
of 2 and equal, but not necessarily.)

How do I find the minimum values of P and Q, in a cross-platform
manner, such that accesses to X and Y will never affect one another
due to a race?

It may seem obvious that if X and Y are anything at least as large as
a byte, then this will simply never happen. But I'm not so sure
because:

-- It's conceivable that someone could design a very efficient chip in
which it was assumed that no variables belonging to different threads
could ever share a cache line, and presumably C99/C11 could be
implemented on such a chip. While I could access individual bytes, I
could not access the same cache line in different threads without
creating a race. So I'd somehow need to preempt this possibility by
ensuring sufficient distance between variables belonging to different
threads. (Or maybe one just couldn't implement C on such a chip, so
it's a moot issue.)

-- I know I can use all the atomic data types in C11. I don't want to
do that, because it explicitly forces single-threadedness. What I'm
trying to accomplish here is guaranteed thread orthogonality by virtue
of sufficiently large (a) memory alignment and (b) memory access
granularity.

-- Empirically, using unpacked (struct)s doesn't accomplish the goal
because the unpacking granularity is below the maximum CPU word size.

-- Empirically, (malloc)ing orthogonal memory regions for different
threads does in fact accomplish the goal, at high expense due to
allocation overhead and pointer dereferencing, as opposed to
straightforward array indexing.

-- I'm particularly worried that memmove(), memcpy(), or some other
builtin will try to be efficient and move large chunks of data at a
time, without realizing that the source or destination alignment is
unfriendly to such chunk size. This could result in read-modify-write
conflicts with neighboring thread domains.

-- Assuming that P and Q can be found, then it looks like a major pain
to try to enforce this at build time, i.e. where's the
pad-struct-to-mulitiple-of-Q macro?

I suspect that there's a rule somewhere that makes all this simple and
clear. I've tried searching, but all I find are comments about atomic
data types and not doing stupid stuff like assuming timing
consistency. Please chime in if you can offer insight (or even if you
know that I'm sunk because there is no such guarantee and it's all
hardware-dependent).

Veiokej