Hi there, For the below simple case, I think poly1[i] should be hoisted to outermost loop to avoid loading from innermost loop at each iteration. But for arm-none-eabi target like cortex-m4, gcc fails to do so. Is this a normal case or a missing optimization? Please advise. void PolyMul (float *poly1, unsigned int n1, float *poly2, unsigned int n2, float *polymul, unsigned int *nmul) { unsigned int i, j; for (i = 0; i <= n1; i++) for (j = 0; j <= n2; j++) polymul[i+j] += poly1[i] * poly2[j]; } Thanks. BR, Terry