Help with extended asm syntax and related topics

John Fine <johnsfine@xxxxxxxxxxx> · Sat, 11 Oct 2008 17:43:33 -0400

Background:  The following overload of a more general template is the 
most performance significant code in my large project.

typedef unsigned int uint;
template <class Ts, class Tc, class Tv>
inline void foo(uint k, uint n, double const& m, Ts s, Tc c, Tv v) {
#pragma ivdep
while (++k < n) s[ c[k] ] -= m * v[k]; }

1) I'm focusing on x86_64 for this performance work.  All timing info 
was captured with consistent instruction level detail using Oprofile.

2) There is absolutely no aliasing.  (s, v and c don't overlap and no 
two values in c[] are equal). I don't know a way to make gcc know that. ???

3) It only goes through this loop a few times, but each caller of this 
function inlines it into a loop that occurs many times in which this 
function is most of the work per iteration.  All of k, n, m, s, c, and v 
are already in registers at the point this function is inlined.  
Immediately after the call k goes out of scope (per iteration of the 
outer loop) and m is clobbered.  So it shouldn't matter that my asm code 
(below) modifies those two registers.

4)  Ts, Tc and Tv are various types at various points the call is made.  
But Ts is always something that can be cast to double*, similarly Tc to 
uint const*, and Tv to double const*.  But if I put those in the 
function declaration, the compiler won't make the cast.  Instead it will 
select the more general template.  Do you know a cleaner solution to 
that ??? (not on my main topic, but I'd like to know that too).

5) It generates decent code with Intel10 and much slower code with 
gcc4.3.2.  All Intel10 does with the ivdep is a 2x unwind of the loop.  
It does that with very messy code and I find it quite surprising that 
such messy code runs much faster than what gcc produces and nearly as 
fast as my hand written asm.

6) I wrote many different versions of the code with a manual 2x unwind, 
some using ordinary doubles, some using __attribute__ ((vector_size 
(16))).  With ordinary doubles (and manual 2x unwind), the resulting 
code from gcc was cleaner, smaller and SLOWER than the Intel10 original 
(but much faster than the non unwound gcc code).  With vector_size, the 
result was even slower than the gcc original code.

7) My problem with vector_size was I could find no way to split two 
doubles cleanly out from a pair.  In asm, you can use movsd and movhpd 
to move a double between memory and half of an xmm register.  GCC used 
those to move doubles into xmm, but I could find no syntax that made it 
use those to move doubles out.  All examples I found online used a 
union.  Anything I tried with a union caused gcc to use a stack location 
as an intermediary for all transfers to and from the register used for 
that union.  Those stack memory accesses were so slow (far slower than 
I'd expect) that the whole routine was hopeless.  So another question 
is, staying just in C++, how do you split doubles out from a pair of 
doubles defined using vector_size???

8) I tried using asm just for the movsd and movhpd to split doubles out 
of a pair.  The results were better than anything I could get without 
asm with gcc, but still a little worse than the original Intel10 code, 
which doesn't uses any parallel operation.

Here is the pseudo code on which my full asm version is based.  It isn't 
too far from some of the things I tried in C++ with various forms of a 
double_2 class (but of course no form of double_2 would make the syntax 
as simple as this pseudo code).  It should be clear the operation 
matches the simple while loop at the top of this post:

   double_2 m2 = { m, m };
   for (;;)
   {
       k += 2;
       if ( k < n )
       {
           uint c0 = c[k-1L];
           uint c1 = c[k];
           double_2 x = { v[k-1L], v[k] };
           double_2 t = { s[c0], s[c1] };
           x *= m2;
           t -= x;
           { s[c0], s[c1] } = t;
       }
       else
       {
           if ( k == n )
           {
               uint c0 = c[k-1L];
               m *= v[k-1L];
               s[c0] -= m;
           }
           return;
       }
   }

The results described in (8) are for the above code with an appropriate 
definition of double_2 and with the non C++ line
{ s[c0], s[c1] } = t;
replaced by asm code.

That finally brings me to the inline asm, that I want to ask my main 
questions about:

template <class Ts, class Tc, class Tv>
inline void foo(uint k, uint n, double const& m, Ts s, Tc c, Tv v) {
   unsigned long c0,c1,k2;
   double x,t,m2;
   __asm__ volatile ("\
       add $2,%k[k]                 # k += 2; \n\
       unpcklpd %[m],%[m]           # m2 = { m, m }; \n\
       jmp 2f\n\
       .align 16\n\
1:\n\
       mov -4(%[c],%[k],4),%k[c0]   # c0 = c[k-1L]; \n\
       mov (%[c],%[k],4),%k[c1]     # c1 = c[k]; \n\
       movsd -8(%[v],%[k],8),%[x]   # x.0 = v[k-1L]; \n\
       movhpd (%[v],%[k],8),%[x]    # x.1 = v[k]; \n\
       movsd (%[s],%[c0],8),%[t]    # t.0 = s[c0]; \n\
       movhpd (%[s],%[c1],8),%[t]   # t.1 = s[c1]; \n\
       mulpd %[m],%[x]              # x *= m2; \n\
       add $2,%k[k]                 # k += 2; \n\
       subpd %[x],%[t]              # t -= x; \n\
       movsd %[t],(%[s],%[c0],8)    # s[c0] = t.0; \n\
       movhpd %[t],(%[s],%[c1],8)   # s[c0] = t.0; \n\
2:\n\
       cmp %k[k],%[n]               # if ( k < n ) \n\
       ja 1b\n\
       jne 3f                       # if ( k == n ) \n\
       mov -4(%[c],%[k],4),%k[c0]   # c0 = c[k-1L]; \n\
       mulsd -8(%[v],%[k],8),%[m]   # m2.0 *= v[k-1L]; \n\
       movsd (%[s],%[c0],8),%[t]    # t.0 = s[c0]; \n\
       subsd %[m],%[t]              # t.0 -= m.0; \n\
       movsd %[t],(%[s],%[c0],8)    # s[c0] = t.0 \n\
3:"
       : [c0] "=&r" (c0),
         [c1] "=&r" (c1),
         [x]  "=&x" (x),
         [t]  "=&x" (t),
         [m]  "=x" (m2),
         [k]  "=r" (k2)
       :"[m]"    (m),
        "[k]"    ((unsigned long)k),
         [n] "r" (n),
         [c] "r" ((uint const*)&c[0]),
         [v] "r" ((double const*)&v[0]),
         [s] "r" ((double*)&s[0])
       : "cc" );
}

The result is significantly smaller and cleaner looking than any 
produced by either compiler and it runs modestly faster than the 
original ugly Intel10 generated code.  Obviously, there are significant 
rules about instruction choice, sequence and position for performance, 
about which I have no clue.  Otherwise the above code would run much 
faster than the long ugly loop generated by Intel10.

9)  I never figured out a reasonable way to indicate that the output is 
to s[various].  If I make a clobber for all of memory, then the outer 
loop would be a little worse because it can't keep a value each from c[] 
and v[] which survive in registers untouched by this whole loop.  In 
theory a caller could keep an element of s[] and crash because I didn't 
make it an output.  In practice they don't.  Better way??? Does this 
interact with (2) ???

10) k and m are both inputs clobbered by this asm code.  I couldn't 
find/understand any doc on the right way to declare clobbered inputs.  
So I declared them as outputs, then declared inputs that are forced to 
the same register as the output.  Does that work ???  Is there a better 
way ???  The way this function is currently called I didn't need to tell 
gcc those two registers were clobbered.  k goes out of scope immediately 
in the caller and m gets clobbered in the caller.  Initially (by mistake 
actually) I only declared m as shown above but made k a simple input.  
The code worked and no extra instruction is generated by gcc for the m 
vs. m2 duality.  Then I fixed the code to declare k as sown above.  GCC 
added one extra mov instruction to move k, which was already in a 
register to the register it had selected for k2.  That would be correct 
if k were used again after this block.  But it isn't, so the extra 
instruction is a waste.  How can I do that better ???

11) I'm not sure how you should declare that an output register can 
overlap the inputs.  The "=&r" and "=&x" seem to work.  Is that the best 
way???

12) I found the doc telling how to override a register size to 32-bit:
%k[c0]
gives you the 32-bit version of a 64-bit register [c0].  But I couldn't 
find the doc for the reverse override.  I may need that in other asm 
code.  What would I do if I wanted to use the 64-bit version of 32-bit 
register [n] ???

13) In this (fastest so far) version of the code, the slowest 
instruction is the first memory read after c0 = c[k-1L]; I have 
rearranged the instruction sequence a few ways without changing that 
fact and without changing the total time measurably.  I assume usually 
c0 = c[k-1L]; causes a cache miss, which stalls the next memory read.  I 
don't understand such things in any detail.  How do you put in a preload 
of a cache line and is it likely to be worth it ???

14) v is 16 byte aligned, but the initial value of k is 50% likely to be 
odd.  So I think combining the two sequential reads v[k-1] and v[k] into 
one 128-bit read would likely do more harm than good.  Correct ???  If 
this inner loop were many iterations, I would start by testing whether k 
was odd and if not do one scaler step first, so k is odd (k-1 even).  
But the loop with many iterations is the one immediately wrapping this 
loop.  I wouldn't have even guessed (prior to seeing the results of the 
Intel10 auto 2x unwind of this loop) that a 2x unwind would be worth 
it.  Any estimate ??? a few pairs of 64-bit loads combined to 128-bit 
vs. another unpredictable branch and some extra code polluting the L1 
code cache ???

15) The worst memory accesses are to c[k-1] and c[k].  c is also 16 byte 
aligned.  But I don't know a good way to split two 32-bit unsigned ints 
in either general or xmm register into two zero extended unsigned 
long's.  So I can't see a way to benefit from combining those two 
reads.  Any ideas ???

16) movsd vs. movlpd ???  When moving to or from the low double of an 
xmm register, in most cases either of those two instructions would do 
the job.  The subtle behavior difference rarely matters to correct 
operation.  But I think the performance difference does.  I think it got 
faster when I replaced my original movlpd with movsd.  But performance 
sampling is a complex art and that difference wasn't clear enough for me 
to conclude it was real.  Do you know some good theory on which to base 
that choice ???