Nirav Shah wrote: > I am having a complicated mathematical expression to compute. What > would be a good way to compute it? > 1. Should I write a single statement with many mathematical operations > in same statement? or > 2. Fragment the expression and compute one operation in each step and > storing partial results in local variables and using them in > subsequent operations? > > Readability or maintainability is not an issue for me. I want > performance as high as possible, in terms of less clock cycles, low > cache misses and low memory access. Even few clock cycle save or few > less cache misses or memory access would be big benefits to me. > If a partial result will be used more than once, it's often helpful to store it in a local automatic variable. gcc optimization usually can recognize common subexpressions starting from the left. Use of parentheses and common factors may be important to minimize the number of repeated operations, as in the following re-write of a segment of the public netlib vectors benchmark: a1 *= d1 *(e1 + f1)+ e1 * f1 + c1 *(e1 + f1 + d1) + b1 *(e1 + f1 + d1 + c1); b1 *= d1 *(e1 + f1)+ e1 * f1 + c1 *(e1 + f1 + d1); c1 *= d1 *(e1 + f1)+ e1 * f1; d1 *= e1 * f1; s[i__ - 1] = a1 * b1 * c1 * d1; gfortran is permitted to extract common factors this way, but it will recognize only a few of them from the original version. The C standard doesn't permit automatic optimization of common factors. The expressions to the right of c1 *= and b1 *= would be the first candidates for assignment to additional local temporaries, but gcc optimization should work as written. Examination of the code generated by -S or by objdump -S *.o should be helpful in diagnosis. Horner's rule minimizes the number of operations required for polynomial evaluation, but forces more serialization than necessary. Thus the following suggested implementation of tanh(): __inline_mathcodeNP (tanh, __x, \ if(__fabsl(__x) <= .34657){ \ long double __x2 = __x * __x; \ long double __x4 = __x2*__x2; \ return __x + __x2*__x*( \ -0.3333333333333333333028L \ +__x2*(0.133333333333333321200L \ +__x2*-0.5396825396825207695E-01L \ +__x4*(0.218694885360028124E-01L \ +__x2*-0.88632355226515778E-02 \ +__x4*(0.3592127817609080E-02 \ +__x2*-0.14558300258105E-02) \ +__x4*__x4*(0.5899693119329E-03 \ +__x2*-0.238614526828E-03 \ +__x4*(0.9399418484E-04 \ +__x2*-0.294863013E-04)))));} \ else return 1 - 2 / (expl(__x + __x) + 1)) This points up an exception to the common factors optimization rule: x + x2*x should not be written as x*(1 + x2), as this would not save an operation, and would reduce accuracy for the desirable case where the x term is larger than the trailing terms. (x+x) is preferred to x*2, but optimizers should be capable of making these choices automatically. Unless your expression is evaluated in a loop with serial dependencies, this is one situation where cache and memory considerations don't intrude. Since you raised that question, we may suspect you haven't shown enough for a useful answer.