Re: Fwd: GCC Optimization for expressions

"Nirav Shah" <nirav4ever4u@xxxxxxxxx> · Mon, 1 Dec 2008 01:37:57 +0530

On Mon, Dec 1, 2008 at 1:05 AM, Tim Prince <TimothyPrince@xxxxxxxxxxxxx> wrote:
> Nirav Shah wrote:
>
>> I am having a complicated mathematical expression to compute. What
>> would be a good way to compute it?
>> 1. Should I write a single statement with many mathematical operations
>> in same statement? or
>> 2. Fragment the expression and compute one operation in each step and
>> storing partial results in local variables and using them in
>> subsequent operations?
>>
>> Readability or maintainability is not an issue for me. I want
>> performance as high as possible, in terms of less clock cycles, low
>> cache misses and low memory access. Even few clock cycle save or few
>> less cache misses or memory access would be big benefits to me.
>>
> If a partial result will be used more than once, it's often helpful to
> store it in a local automatic variable.  gcc optimization usually can
> recognize common subexpressions starting from the left.
> Use of parentheses and common factors may be important to minimize the
> number of repeated operations, as in the following re-write of a segment
> of the public netlib vectors benchmark:
>
>  a1 *= d1 *(e1 + f1)+ e1 * f1 + c1 *(e1 + f1 + d1)
>              + b1 *(e1 + f1 + d1 + c1);
>  b1 *= d1 *(e1 + f1)+ e1 * f1 + c1 *(e1 + f1 + d1);
>  c1 *= d1 *(e1 + f1)+ e1 * f1;
>  d1 *= e1 * f1;
>  s[i__ - 1] = a1 * b1 * c1 * d1;
>
> gfortran is permitted to extract common factors this way, but it will
> recognize only a few of them from the original version.  The C standard
> doesn't permit automatic optimization of common factors.
> The expressions to the right of c1 *=  and b1 *= would be the first
> candidates for assignment to additional local temporaries, but gcc
> optimization should work as written.
> Examination of the code generated by -S or by objdump -S *.o should be
> helpful in diagnosis.
>
> Horner's rule minimizes the number of operations required for polynomial
> evaluation, but forces more serialization than necessary.  Thus the
> following suggested implementation of tanh():
>
> __inline_mathcodeNP (tanh, __x, \
>  if(__fabsl(__x) <= .34657){           \
>        long double __x2 = __x * __x;   \
>        long double __x4 = __x2*__x2;   \
>          return  __x + __x2*__x*(      \
>  -0.3333333333333333333028L             \
>  +__x2*(0.133333333333333321200L       \
>  +__x2*-0.5396825396825207695E-01L      \
>  +__x4*(0.218694885360028124E-01L      \
>  +__x2*-0.88632355226515778E-02         \
>  +__x4*(0.3592127817609080E-02         \
>  +__x2*-0.14558300258105E-02)           \
>  +__x4*__x4*(0.5899693119329E-03       \
>  +__x2*-0.238614526828E-03              \
>  +__x4*(0.9399418484E-04               \
>  +__x2*-0.294863013E-04)))));}          \
>  else return 1 - 2 / (expl(__x + __x) + 1))
>
> This points up an exception to the common factors optimization rule:
> x + x2*x should not be written as x*(1 + x2), as this would not save an
> operation, and would reduce accuracy for the desirable case where the x
> term is larger than the trailing terms.  (x+x) is preferred to x*2, but
> optimizers should be capable of making these choices automatically.
>
> Unless your expression is evaluated in a loop with serial dependencies,
> this is one situation where cache and memory considerations don't intrude.
>  Since you raised that question, we may suspect you haven't shown enough
> for a useful answer.
>
>

Hi Tim,

Thanks for the clarification. And let me open up myself also.

I am working on study of data sharing characteristic of Blackscholes
application of Parsec benchmark. I need to come up with better data
sharing foot print of execution. For this purpose I am experimenting
with the code for my study purpose only. In the original code, while
computing the blackscholes PDE, the author of the code has used many
local variables to store intermediate results of the computation. The
author is making only one operation in a statement. Hence having many
statements and intermediate auto variables in the code. I am trying to
execute entire computation in one or max two statements eliminating
use of extra auto variables.

No expression is evaluated in loop nor any particular intermediate
result is used more than once. Given this, can I get benefit in terms
of lesser clock cycle to execute the computation if done in a single
statement as compared to done over multiple statements with local
variables?

Regards,
Nirav Shah.