Re: Fwd: GCC Optimization for expressions

Tim Prince <TimothyPrince@xxxxxxxxxxxxx> · Sun, 30 Nov 2008 11:35:45 -0800

Nirav Shah wrote:

> I am having a complicated mathematical expression to compute. What
> would be a good way to compute it?
> 1. Should I write a single statement with many mathematical operations
> in same statement? or
> 2. Fragment the expression and compute one operation in each step and
> storing partial results in local variables and using them in
> subsequent operations?
> 
> Readability or maintainability is not an issue for me. I want
> performance as high as possible, in terms of less clock cycles, low
> cache misses and low memory access. Even few clock cycle save or few
> less cache misses or memory access would be big benefits to me.
> 
If a partial result will be used more than once, it's often helpful to
store it in a local automatic variable.  gcc optimization usually can
recognize common subexpressions starting from the left.
Use of parentheses and common factors may be important to minimize the
number of repeated operations, as in the following re-write of a segment
of the public netlib vectors benchmark:

  a1 *= d1 *(e1 + f1)+ e1 * f1 + c1 *(e1 + f1 + d1)
              + b1 *(e1 + f1 + d1 + c1);
  b1 *= d1 *(e1 + f1)+ e1 * f1 + c1 *(e1 + f1 + d1);
  c1 *= d1 *(e1 + f1)+ e1 * f1;
  d1 *= e1 * f1;
  s[i__ - 1] = a1 * b1 * c1 * d1;

gfortran is permitted to extract common factors this way, but it will
recognize only a few of them from the original version.  The C standard
doesn't permit automatic optimization of common factors.
The expressions to the right of c1 *=  and b1 *= would be the first
candidates for assignment to additional local temporaries, but gcc
optimization should work as written.
Examination of the code generated by -S or by objdump -S *.o should be
helpful in diagnosis.

Horner's rule minimizes the number of operations required for polynomial
evaluation, but forces more serialization than necessary.  Thus the
following suggested implementation of tanh():

__inline_mathcodeNP (tanh, __x, \
  if(__fabsl(__x) <= .34657){           \
        long double __x2 = __x * __x;   \
        long double __x4 = __x2*__x2;   \
          return  __x + __x2*__x*(      \
 -0.3333333333333333333028L             \
  +__x2*(0.133333333333333321200L       \
 +__x2*-0.5396825396825207695E-01L      \
  +__x4*(0.218694885360028124E-01L      \
 +__x2*-0.88632355226515778E-02         \
  +__x4*(0.3592127817609080E-02         \
 +__x2*-0.14558300258105E-02)           \
  +__x4*__x4*(0.5899693119329E-03       \
 +__x2*-0.238614526828E-03              \
  +__x4*(0.9399418484E-04               \
 +__x2*-0.294863013E-04)))));}          \
 else return 1 - 2 / (expl(__x + __x) + 1))

This points up an exception to the common factors optimization rule:
x + x2*x should not be written as x*(1 + x2), as this would not save an
operation, and would reduce accuracy for the desirable case where the x
term is larger than the trailing terms.  (x+x) is preferred to x*2, but
optimizers should be capable of making these choices automatically.

Unless your expression is evaluated in a loop with serial dependencies,
this is one situation where cache and memory considerations don't intrude.
 Since you raised that question, we may suspect you haven't shown enough
for a useful answer.