On Mon, Dec 1, 2008 at 1:05 AM, Tim Prince <TimothyPrince@xxxxxxxxxxxxx> wrote: > Nirav Shah wrote: > >> I am having a complicated mathematical expression to compute. What >> would be a good way to compute it? >> 1. Should I write a single statement with many mathematical operations >> in same statement? or >> 2. Fragment the expression and compute one operation in each step and >> storing partial results in local variables and using them in >> subsequent operations? >> >> Readability or maintainability is not an issue for me. I want >> performance as high as possible, in terms of less clock cycles, low >> cache misses and low memory access. Even few clock cycle save or few >> less cache misses or memory access would be big benefits to me. >> > If a partial result will be used more than once, it's often helpful to > store it in a local automatic variable. gcc optimization usually can > recognize common subexpressions starting from the left. > Use of parentheses and common factors may be important to minimize the > number of repeated operations, as in the following re-write of a segment > of the public netlib vectors benchmark: > > a1 *= d1 *(e1 + f1)+ e1 * f1 + c1 *(e1 + f1 + d1) > + b1 *(e1 + f1 + d1 + c1); > b1 *= d1 *(e1 + f1)+ e1 * f1 + c1 *(e1 + f1 + d1); > c1 *= d1 *(e1 + f1)+ e1 * f1; > d1 *= e1 * f1; > s[i__ - 1] = a1 * b1 * c1 * d1; > > gfortran is permitted to extract common factors this way, but it will > recognize only a few of them from the original version. The C standard > doesn't permit automatic optimization of common factors. > The expressions to the right of c1 *= and b1 *= would be the first > candidates for assignment to additional local temporaries, but gcc > optimization should work as written. > Examination of the code generated by -S or by objdump -S *.o should be > helpful in diagnosis. > > Horner's rule minimizes the number of operations required for polynomial > evaluation, but forces more serialization than necessary. Thus the > following suggested implementation of tanh(): > > __inline_mathcodeNP (tanh, __x, \ > if(__fabsl(__x) <= .34657){ \ > long double __x2 = __x * __x; \ > long double __x4 = __x2*__x2; \ > return __x + __x2*__x*( \ > -0.3333333333333333333028L \ > +__x2*(0.133333333333333321200L \ > +__x2*-0.5396825396825207695E-01L \ > +__x4*(0.218694885360028124E-01L \ > +__x2*-0.88632355226515778E-02 \ > +__x4*(0.3592127817609080E-02 \ > +__x2*-0.14558300258105E-02) \ > +__x4*__x4*(0.5899693119329E-03 \ > +__x2*-0.238614526828E-03 \ > +__x4*(0.9399418484E-04 \ > +__x2*-0.294863013E-04)))));} \ > else return 1 - 2 / (expl(__x + __x) + 1)) > > This points up an exception to the common factors optimization rule: > x + x2*x should not be written as x*(1 + x2), as this would not save an > operation, and would reduce accuracy for the desirable case where the x > term is larger than the trailing terms. (x+x) is preferred to x*2, but > optimizers should be capable of making these choices automatically. > > Unless your expression is evaluated in a loop with serial dependencies, > this is one situation where cache and memory considerations don't intrude. > Since you raised that question, we may suspect you haven't shown enough > for a useful answer. > > Hi Tim, Thanks for the clarification. And let me open up myself also. I am working on study of data sharing characteristic of Blackscholes application of Parsec benchmark. I need to come up with better data sharing foot print of execution. For this purpose I am experimenting with the code for my study purpose only. In the original code, while computing the blackscholes PDE, the author of the code has used many local variables to store intermediate results of the computation. The author is making only one operation in a statement. Hence having many statements and intermediate auto variables in the code. I am trying to execute entire computation in one or max two statements eliminating use of extra auto variables. No expression is evaluated in loop nor any particular intermediate result is used more than once. Given this, can I get benefit in terms of lesser clock cycle to execute the computation if done in a single statement as compared to done over multiple statements with local variables? Regards, Nirav Shah.