Background: The following overload of a more general template is the
most performance significant code in my large project.
typedef unsigned int uint;
template <class Ts, class Tc, class Tv>
inline void foo(uint k, uint n, double const& m, Ts s, Tc c, Tv v) {
#pragma ivdep
while (++k < n) s[ c[k] ] -= m * v[k]; }
1) I'm focusing on x86_64 for this performance work. All timing info
was captured with consistent instruction level detail using Oprofile.
2) There is absolutely no aliasing. (s, v and c don't overlap and no
two values in c[] are equal). I don't know a way to make gcc know that. ???
3) It only goes through this loop a few times, but each caller of this
function inlines it into a loop that occurs many times in which this
function is most of the work per iteration. All of k, n, m, s, c, and v
are already in registers at the point this function is inlined.
Immediately after the call k goes out of scope (per iteration of the
outer loop) and m is clobbered. So it shouldn't matter that my asm code
(below) modifies those two registers.
4) Ts, Tc and Tv are various types at various points the call is made.
But Ts is always something that can be cast to double*, similarly Tc to
uint const*, and Tv to double const*. But if I put those in the
function declaration, the compiler won't make the cast. Instead it will
select the more general template. Do you know a cleaner solution to
that ??? (not on my main topic, but I'd like to know that too).
5) It generates decent code with Intel10 and much slower code with
gcc4.3.2. All Intel10 does with the ivdep is a 2x unwind of the loop.
It does that with very messy code and I find it quite surprising that
such messy code runs much faster than what gcc produces and nearly as
fast as my hand written asm.
6) I wrote many different versions of the code with a manual 2x unwind,
some using ordinary doubles, some using __attribute__ ((vector_size
(16))). With ordinary doubles (and manual 2x unwind), the resulting
code from gcc was cleaner, smaller and SLOWER than the Intel10 original
(but much faster than the non unwound gcc code). With vector_size, the
result was even slower than the gcc original code.
7) My problem with vector_size was I could find no way to split two
doubles cleanly out from a pair. In asm, you can use movsd and movhpd
to move a double between memory and half of an xmm register. GCC used
those to move doubles into xmm, but I could find no syntax that made it
use those to move doubles out. All examples I found online used a
union. Anything I tried with a union caused gcc to use a stack location
as an intermediary for all transfers to and from the register used for
that union. Those stack memory accesses were so slow (far slower than
I'd expect) that the whole routine was hopeless. So another question
is, staying just in C++, how do you split doubles out from a pair of
doubles defined using vector_size???
8) I tried using asm just for the movsd and movhpd to split doubles out
of a pair. The results were better than anything I could get without
asm with gcc, but still a little worse than the original Intel10 code,
which doesn't uses any parallel operation.
Here is the pseudo code on which my full asm version is based. It isn't
too far from some of the things I tried in C++ with various forms of a
double_2 class (but of course no form of double_2 would make the syntax
as simple as this pseudo code). It should be clear the operation
matches the simple while loop at the top of this post:
double_2 m2 = { m, m };
for (;;)
{
k += 2;
if ( k < n )
{
uint c0 = c[k-1L];
uint c1 = c[k];
double_2 x = { v[k-1L], v[k] };
double_2 t = { s[c0], s[c1] };
x *= m2;
t -= x;
{ s[c0], s[c1] } = t;
}
else
{
if ( k == n )
{
uint c0 = c[k-1L];
m *= v[k-1L];
s[c0] -= m;
}
return;
}
}
The results described in (8) are for the above code with an appropriate
definition of double_2 and with the non C++ line
{ s[c0], s[c1] } = t;
replaced by asm code.
That finally brings me to the inline asm, that I want to ask my main
questions about:
template <class Ts, class Tc, class Tv>
inline void foo(uint k, uint n, double const& m, Ts s, Tc c, Tv v) {
unsigned long c0,c1,k2;
double x,t,m2;
__asm__ volatile ("\
add $2,%k[k] # k += 2; \n\
unpcklpd %[m],%[m] # m2 = { m, m }; \n\
jmp 2f\n\
.align 16\n\
1:\n\
mov -4(%[c],%[k],4),%k[c0] # c0 = c[k-1L]; \n\
mov (%[c],%[k],4),%k[c1] # c1 = c[k]; \n\
movsd -8(%[v],%[k],8),%[x] # x.0 = v[k-1L]; \n\
movhpd (%[v],%[k],8),%[x] # x.1 = v[k]; \n\
movsd (%[s],%[c0],8),%[t] # t.0 = s[c0]; \n\
movhpd (%[s],%[c1],8),%[t] # t.1 = s[c1]; \n\
mulpd %[m],%[x] # x *= m2; \n\
add $2,%k[k] # k += 2; \n\
subpd %[x],%[t] # t -= x; \n\
movsd %[t],(%[s],%[c0],8) # s[c0] = t.0; \n\
movhpd %[t],(%[s],%[c1],8) # s[c0] = t.0; \n\
2:\n\
cmp %k[k],%[n] # if ( k < n ) \n\
ja 1b\n\
jne 3f # if ( k == n ) \n\
mov -4(%[c],%[k],4),%k[c0] # c0 = c[k-1L]; \n\
mulsd -8(%[v],%[k],8),%[m] # m2.0 *= v[k-1L]; \n\
movsd (%[s],%[c0],8),%[t] # t.0 = s[c0]; \n\
subsd %[m],%[t] # t.0 -= m.0; \n\
movsd %[t],(%[s],%[c0],8) # s[c0] = t.0 \n\
3:"
: [c0] "=&r" (c0),
[c1] "=&r" (c1),
[x] "=&x" (x),
[t] "=&x" (t),
[m] "=x" (m2),
[k] "=r" (k2)
:"[m]" (m),
"[k]" ((unsigned long)k),
[n] "r" (n),
[c] "r" ((uint const*)&c[0]),
[v] "r" ((double const*)&v[0]),
[s] "r" ((double*)&s[0])
: "cc" );
}
The result is significantly smaller and cleaner looking than any
produced by either compiler and it runs modestly faster than the
original ugly Intel10 generated code. Obviously, there are significant
rules about instruction choice, sequence and position for performance,
about which I have no clue. Otherwise the above code would run much
faster than the long ugly loop generated by Intel10.
9) I never figured out a reasonable way to indicate that the output is
to s[various]. If I make a clobber for all of memory, then the outer
loop would be a little worse because it can't keep a value each from c[]
and v[] which survive in registers untouched by this whole loop. In
theory a caller could keep an element of s[] and crash because I didn't
make it an output. In practice they don't. Better way??? Does this
interact with (2) ???
10) k and m are both inputs clobbered by this asm code. I couldn't
find/understand any doc on the right way to declare clobbered inputs.
So I declared them as outputs, then declared inputs that are forced to
the same register as the output. Does that work ??? Is there a better
way ??? The way this function is currently called I didn't need to tell
gcc those two registers were clobbered. k goes out of scope immediately
in the caller and m gets clobbered in the caller. Initially (by mistake
actually) I only declared m as shown above but made k a simple input.
The code worked and no extra instruction is generated by gcc for the m
vs. m2 duality. Then I fixed the code to declare k as sown above. GCC
added one extra mov instruction to move k, which was already in a
register to the register it had selected for k2. That would be correct
if k were used again after this block. But it isn't, so the extra
instruction is a waste. How can I do that better ???
11) I'm not sure how you should declare that an output register can
overlap the inputs. The "=&r" and "=&x" seem to work. Is that the best
way???
12) I found the doc telling how to override a register size to 32-bit:
%k[c0]
gives you the 32-bit version of a 64-bit register [c0]. But I couldn't
find the doc for the reverse override. I may need that in other asm
code. What would I do if I wanted to use the 64-bit version of 32-bit
register [n] ???
13) In this (fastest so far) version of the code, the slowest
instruction is the first memory read after c0 = c[k-1L]; I have
rearranged the instruction sequence a few ways without changing that
fact and without changing the total time measurably. I assume usually
c0 = c[k-1L]; causes a cache miss, which stalls the next memory read. I
don't understand such things in any detail. How do you put in a preload
of a cache line and is it likely to be worth it ???
14) v is 16 byte aligned, but the initial value of k is 50% likely to be
odd. So I think combining the two sequential reads v[k-1] and v[k] into
one 128-bit read would likely do more harm than good. Correct ??? If
this inner loop were many iterations, I would start by testing whether k
was odd and if not do one scaler step first, so k is odd (k-1 even).
But the loop with many iterations is the one immediately wrapping this
loop. I wouldn't have even guessed (prior to seeing the results of the
Intel10 auto 2x unwind of this loop) that a 2x unwind would be worth
it. Any estimate ??? a few pairs of 64-bit loads combined to 128-bit
vs. another unpredictable branch and some extra code polluting the L1
code cache ???
15) The worst memory accesses are to c[k-1] and c[k]. c is also 16 byte
aligned. But I don't know a good way to split two 32-bit unsigned ints
in either general or xmm register into two zero extended unsigned
long's. So I can't see a way to benefit from combining those two
reads. Any ideas ???
16) movsd vs. movlpd ??? When moving to or from the low double of an
xmm register, in most cases either of those two instructions would do
the job. The subtle behavior difference rarely matters to correct
operation. But I think the performance difference does. I think it got
faster when I replaced my original movlpd with movsd. But performance
sampling is a complex art and that difference wasn't clear enough for me
to conclude it was real. Do you know some good theory on which to base
that choice ???