Help with extended asm syntax and related topics

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Background: The following overload of a more general template is the most performance significant code in my large project.

typedef unsigned int uint;
template <class Ts, class Tc, class Tv>
inline void foo(uint k, uint n, double const& m, Ts s, Tc c, Tv v) {
#pragma ivdep
while (++k < n) s[ c[k] ] -= m * v[k]; }

1) I'm focusing on x86_64 for this performance work. All timing info was captured with consistent instruction level detail using Oprofile.

2) There is absolutely no aliasing. (s, v and c don't overlap and no two values in c[] are equal). I don't know a way to make gcc know that. ???

3) It only goes through this loop a few times, but each caller of this function inlines it into a loop that occurs many times in which this function is most of the work per iteration. All of k, n, m, s, c, and v are already in registers at the point this function is inlined. Immediately after the call k goes out of scope (per iteration of the outer loop) and m is clobbered. So it shouldn't matter that my asm code (below) modifies those two registers.

4) Ts, Tc and Tv are various types at various points the call is made. But Ts is always something that can be cast to double*, similarly Tc to uint const*, and Tv to double const*. But if I put those in the function declaration, the compiler won't make the cast. Instead it will select the more general template. Do you know a cleaner solution to that ??? (not on my main topic, but I'd like to know that too).

5) It generates decent code with Intel10 and much slower code with gcc4.3.2. All Intel10 does with the ivdep is a 2x unwind of the loop. It does that with very messy code and I find it quite surprising that such messy code runs much faster than what gcc produces and nearly as fast as my hand written asm.

6) I wrote many different versions of the code with a manual 2x unwind, some using ordinary doubles, some using __attribute__ ((vector_size (16))). With ordinary doubles (and manual 2x unwind), the resulting code from gcc was cleaner, smaller and SLOWER than the Intel10 original (but much faster than the non unwound gcc code). With vector_size, the result was even slower than the gcc original code.

7) My problem with vector_size was I could find no way to split two doubles cleanly out from a pair. In asm, you can use movsd and movhpd to move a double between memory and half of an xmm register. GCC used those to move doubles into xmm, but I could find no syntax that made it use those to move doubles out. All examples I found online used a union. Anything I tried with a union caused gcc to use a stack location as an intermediary for all transfers to and from the register used for that union. Those stack memory accesses were so slow (far slower than I'd expect) that the whole routine was hopeless. So another question is, staying just in C++, how do you split doubles out from a pair of doubles defined using vector_size???

8) I tried using asm just for the movsd and movhpd to split doubles out of a pair. The results were better than anything I could get without asm with gcc, but still a little worse than the original Intel10 code, which doesn't uses any parallel operation.

Here is the pseudo code on which my full asm version is based. It isn't too far from some of the things I tried in C++ with various forms of a double_2 class (but of course no form of double_2 would make the syntax as simple as this pseudo code). It should be clear the operation matches the simple while loop at the top of this post:

   double_2 m2 = { m, m };
   for (;;)
   {
       k += 2;
       if ( k < n )
       {
           uint c0 = c[k-1L];
           uint c1 = c[k];
           double_2 x = { v[k-1L], v[k] };
           double_2 t = { s[c0], s[c1] };
           x *= m2;
           t -= x;
           { s[c0], s[c1] } = t;
       }
       else
       {
           if ( k == n )
           {
               uint c0 = c[k-1L];
               m *= v[k-1L];
               s[c0] -= m;
           }
           return;
       }
   }

The results described in (8) are for the above code with an appropriate definition of double_2 and with the non C++ line
{ s[c0], s[c1] } = t;
replaced by asm code.

That finally brings me to the inline asm, that I want to ask my main questions about:

template <class Ts, class Tc, class Tv>
inline void foo(uint k, uint n, double const& m, Ts s, Tc c, Tv v) {
   unsigned long c0,c1,k2;
   double x,t,m2;
   __asm__ volatile ("\
       add $2,%k[k]                 # k += 2; \n\
       unpcklpd %[m],%[m]           # m2 = { m, m }; \n\
       jmp 2f\n\
       .align 16\n\
1:\n\
       mov -4(%[c],%[k],4),%k[c0]   # c0 = c[k-1L]; \n\
       mov (%[c],%[k],4),%k[c1]     # c1 = c[k]; \n\
       movsd -8(%[v],%[k],8),%[x]   # x.0 = v[k-1L]; \n\
       movhpd (%[v],%[k],8),%[x]    # x.1 = v[k]; \n\
       movsd (%[s],%[c0],8),%[t]    # t.0 = s[c0]; \n\
       movhpd (%[s],%[c1],8),%[t]   # t.1 = s[c1]; \n\
       mulpd %[m],%[x]              # x *= m2; \n\
       add $2,%k[k]                 # k += 2; \n\
       subpd %[x],%[t]              # t -= x; \n\
       movsd %[t],(%[s],%[c0],8)    # s[c0] = t.0; \n\
       movhpd %[t],(%[s],%[c1],8)   # s[c0] = t.0; \n\
2:\n\
       cmp %k[k],%[n]               # if ( k < n ) \n\
       ja 1b\n\
       jne 3f                       # if ( k == n ) \n\
       mov -4(%[c],%[k],4),%k[c0]   # c0 = c[k-1L]; \n\
       mulsd -8(%[v],%[k],8),%[m]   # m2.0 *= v[k-1L]; \n\
       movsd (%[s],%[c0],8),%[t]    # t.0 = s[c0]; \n\
       subsd %[m],%[t]              # t.0 -= m.0; \n\
       movsd %[t],(%[s],%[c0],8)    # s[c0] = t.0 \n\
3:"
       : [c0] "=&r" (c0),
         [c1] "=&r" (c1),
         [x]  "=&x" (x),
         [t]  "=&x" (t),
         [m]  "=x" (m2),
         [k]  "=r" (k2)
       :"[m]"    (m),
        "[k]"    ((unsigned long)k),
         [n] "r" (n),
         [c] "r" ((uint const*)&c[0]),
         [v] "r" ((double const*)&v[0]),
         [s] "r" ((double*)&s[0])
       : "cc" );
}

The result is significantly smaller and cleaner looking than any produced by either compiler and it runs modestly faster than the original ugly Intel10 generated code. Obviously, there are significant rules about instruction choice, sequence and position for performance, about which I have no clue. Otherwise the above code would run much faster than the long ugly loop generated by Intel10.

9) I never figured out a reasonable way to indicate that the output is to s[various]. If I make a clobber for all of memory, then the outer loop would be a little worse because it can't keep a value each from c[] and v[] which survive in registers untouched by this whole loop. In theory a caller could keep an element of s[] and crash because I didn't make it an output. In practice they don't. Better way??? Does this interact with (2) ???

10) k and m are both inputs clobbered by this asm code. I couldn't find/understand any doc on the right way to declare clobbered inputs. So I declared them as outputs, then declared inputs that are forced to the same register as the output. Does that work ??? Is there a better way ??? The way this function is currently called I didn't need to tell gcc those two registers were clobbered. k goes out of scope immediately in the caller and m gets clobbered in the caller. Initially (by mistake actually) I only declared m as shown above but made k a simple input. The code worked and no extra instruction is generated by gcc for the m vs. m2 duality. Then I fixed the code to declare k as sown above. GCC added one extra mov instruction to move k, which was already in a register to the register it had selected for k2. That would be correct if k were used again after this block. But it isn't, so the extra instruction is a waste. How can I do that better ???

11) I'm not sure how you should declare that an output register can overlap the inputs. The "=&r" and "=&x" seem to work. Is that the best way???

12) I found the doc telling how to override a register size to 32-bit:
%k[c0]
gives you the 32-bit version of a 64-bit register [c0]. But I couldn't find the doc for the reverse override. I may need that in other asm code. What would I do if I wanted to use the 64-bit version of 32-bit register [n] ???

13) In this (fastest so far) version of the code, the slowest instruction is the first memory read after c0 = c[k-1L]; I have rearranged the instruction sequence a few ways without changing that fact and without changing the total time measurably. I assume usually c0 = c[k-1L]; causes a cache miss, which stalls the next memory read. I don't understand such things in any detail. How do you put in a preload of a cache line and is it likely to be worth it ???

14) v is 16 byte aligned, but the initial value of k is 50% likely to be odd. So I think combining the two sequential reads v[k-1] and v[k] into one 128-bit read would likely do more harm than good. Correct ??? If this inner loop were many iterations, I would start by testing whether k was odd and if not do one scaler step first, so k is odd (k-1 even). But the loop with many iterations is the one immediately wrapping this loop. I wouldn't have even guessed (prior to seeing the results of the Intel10 auto 2x unwind of this loop) that a 2x unwind would be worth it. Any estimate ??? a few pairs of 64-bit loads combined to 128-bit vs. another unpredictable branch and some extra code polluting the L1 code cache ???

15) The worst memory accesses are to c[k-1] and c[k]. c is also 16 byte aligned. But I don't know a good way to split two 32-bit unsigned ints in either general or xmm register into two zero extended unsigned long's. So I can't see a way to benefit from combining those two reads. Any ideas ???

16) movsd vs. movlpd ??? When moving to or from the low double of an xmm register, in most cases either of those two instructions would do the job. The subtle behavior difference rarely matters to correct operation. But I think the performance difference does. I think it got faster when I replaced my original movlpd with movsd. But performance sampling is a complex art and that difference wasn't clear enough for me to conclude it was real. Do you know some good theory on which to base that choice ???









[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux