One last question, if you would allow me to extend a little bit... Using the code that is at the end of this e-mail, and the following flags: -Ofast -funroll-loops --param max-completely-peeled-insns=10000 --param max-completely-peel-times=10000 The instructions executed by evaluate(...) are less than evaluate_fact(...) as shown by the callgrind output: callgrind.out.13969:fn=(1078) double evaluate_fact<20>(double const*, double const*) callgrind.out.13969-0 25334 callgrind.out.13969- -- callgrind.out.13970:fn=(1078) double evaluate<20>(double const*, double const*) callgrind.out.13970-0 17636 callgrind.out.13970- Why could this be happening? What optimizations could be hindered by applying the distributive law by hand? Thanks in advance to all. Best, -- Astor template<int k> double __attribute__ ((noinline)) evaluate(const double u[], const double phi[]) { double f = 0.; for (int i3 = 0;i3<k;++i3) for (int i2 = 0;i2<k;++i2) for (int i1 = 0;i1<k;++i1) f += u[i1+k*(i2+k*i3)] * phi[i1] * phi[i2] * phi[i3]; return f; } template<int k> double __attribute__ ((noinline)) evaluate_fact(const double u[], const double phi[]) { double f3 = 0.; for (int i3 = 0;i3<k;++i3) { const int k3 = k*i3; double f2 = 0.; for (int i2 = 0;i2<k;++i2) { const int k2 = k*(i2+k3); double f1 = 0.; for (int i1 = 0;i1<k;++i1) { f1 += u[i1+k2] * phi[i1]; } f2 += f1 * phi[i2]; } f3 += f2 * phi[i3]; } return f3; } int main() { const static unsigned int k=20; const static double u[k*k*k] = {3.14} ; const static double phi[k] = {3.14} ; double e = 0. ; #ifndef FACTOR e += evaluate<k>(u, phi); #else e += evaluate_fact<k>(u, phi); #endif return static_cast<int>(e); } On Thu, Jul 27, 2017 at 7:25 AM, Xi Ruoyao <ryxi@xxxxxxxxxxxxxxxxx> wrote: > On 2017-07-25 18:04 +0200, Astor Piaz wrote: >> Hi Alexander, >> >> Thank you, your answer is pretty clear and your example is much better. >> >> Could I ask you though, why wouldn't this be allowed once the user has >> requested -fassociative-math ? > > Alexander said this is not implemented, but it should be allowed. For > integer operands, this is allowed by ISO C/C++ standards. For floating- > point operands, this should be allowed by -fassociative-math. > >> Is there a fundamental problem to provide this optimization or it is >> just too difficult? > > I don't know, but it seems no existing compilers has implemented it yet. > >> Thanks a lot. >> >> Best, >> -- >> Astor >> >> On Tue, Jul 25, 2017 at 5:11 PM, Alexander Monakov <amonakov@xxxxxxxxx> wrote: >> > Hi, >> > >> > I think what you're asking is not the "usual" loop invariant motion, but rather >> > applying distributive law to reductions. In your example you want the optimizer >> > to replace >> > >> > R = 0; >> > for ( ... ) >> > R += X * C; >> > >> > by >> > >> > R = 0; >> > for ( ... ) >> > R += X; >> > R *= C; >> > >> > We don't do that even for integer operands, the following isn't optimized either: >> > (neither do Clang and ICC according to my experiments on gcc.godbolt.org) >> > >> > int f(int *a) >> > { >> > int r = 0; >> > for (int i = 0; i < 1024; i++) >> > r += a[i] * 5; >> > return r; >> > } >> > >> > Alexander > -- > Xi Ruoyao <ryxi@xxxxxxxxxxxxxxxxx> > School of Aerospace Science and Technology, Xidian University