Re: GCC in-line assembly and the removal of -mcx16

Toebs Douglass <toby@xxxxxxxxxxxxxx> · Sun, 28 May 2017 13:15:27 +0200

Hej, all.

I would like to write a little about the new libatomic mechanism for
double-word CAS on 64-bit platforms, since I've ended up not using it
and I suspect that was not what the developers would have wished for
their hard work.

Back in early 2005 AMD released some of its early 64-bit processors
lacking supporting for double-word CAS.

GCC as I understand it and laudably looks to support a wide range of
platforms and as such introduced the -mcx16 switch, so the user could
indicate the presence of double-word CAS support on x86_64.

So way atomics worked was that GCC provided the __sync and then later
the __atomic intrinsic APIs, wholly independently, and always actually
emitted inline instructions.

Fast forward to today and the release of 7.1.0 and here we see the
mechanism for supporting these platforms has changed.  There -mcx16
switch is not used in the new mechanism; rather, GCC now depends on
libatomic, libatomic depends on ifunc support, GCC calls libatomic for
double-word CAS and libatomic is designed to select the best
implementation of double-word CAS available on the current platform.

On the face of it, this seems entirely reasonable; generic (not just for
x86_64), more flexible and less intrusive (no need for a special switch)
and provides alternative mechanisms on platforms lacking double-word
CAS.  Code can continue to compile, link and run.

I observe however there are some costs.

1. GCC now depends upon libatomic.

My code base consists of data structures only and it targets a bare C
compiler (not just freestanding - bare).  I think libatomic is probably
like libgcc - it should be considered part of the bare compiler - but
libgcc I suspect really is available everywhere you get GCC, but I have
no idea if this is really the case yet for libatomic, or if it might be
available but partially implemented.  (The unclosed bug about libatomic
not initializing its ifuncs correctly on static builds is on my mind.)

2. libatomic seems to depend upon ifuncs.

I do not know how widely supported ifuncs are.  If someone takes my code
on their odd little 17 bit dishwasher, with their port of GCC which only
has support for static linking, will they have ifuncs?

3. libatomic silently substitutes alternatives.

For my use case, only a lock-free instruction can be used.  There are no
alternatives.  If a lock-free instruction is not available, the platform
does not in fact support my code.  If the code compiles and links when
it should not, the user is misled.  He may even unwittingly use the
code, given that the test suite will in fact pass and the benchmark is
harder to port and so may not be ported.

It is not clear to me how I can tell if libatomic is using a lock-free
instruction.  I can check perhaps in my own builds, by inspecting the
assembly, but what about end-user builds?

4. A library call is now being made every time a double word CAS is used.

This cost is small (one jump instruction I believe, after the initial
lookup work), but it is *directly* opposing a primary design goal
(performance) and so is as costly as it is able to be.  The code base
elsewhere is carefully designed to avoid overheads, the only other
function call being the one into the data structure API itself.

What I would observe then about these costs is that GCC's view is longer
and wider than mine as a *user* of GCC.  It is a sensible cost/benefit
trade-off for GCC, but not for me, as a user of GCC.  I am by being
wholly unimportant able to dismiss these early AMD processors, because
so few people use my code no-one will have them.

As such, I bear only the costs, and none of the benefits (and, indeed,
one of the benefits - alternatives - is a serious cost).  I've moved
from a simple and thoroughly understood situation to a complex and
poorly understood situation.

What I see however is that there is a way for me to avoid these costs
and return to the simple situation.  My code has an abstraction layer,
and I can implement inline assembly for double-word CAS on 64-bit
platforms and use that instead of __atomic and __sync.  This only has to
be done for x86_64 (very simple) and aarch64 (complex, alas, so it
goes), since they are the only platforms to offer this.  It is not
needed for 32 bit platforms since GCC does not use libatomic to support
double word CAS on these platforms.

I may be wrong, but I think it is clear to the reader that this is the
sensible choice for me.  It's a small amount of work, with testable
code, rather than the vague and ongoing task of keeping informed about
the state of libatomic implementation, platform support and bugs, and of
course the possibly unsolvable problem of knowing on an end-user build
whether or not libatomic is emitting a lock-free instruction or
something else.

Of course, I have to keep informed about the state of the GCC
implementation of __sync and __atomic, but they are less complex, since
they have no external dependency and if they are unsupported I know it,
as they offer no alternatives.  I can continue to use the knowledge I
have built up about these APIs and not have to build up a second set of
knowledge in parallel about libatomic.