(Add gcc@xxxxxxxxxxx which is more relevant.)
After some investigation, I can answer my own question now.
On 11/11/13 16:39, Yufeng Zhang wrote:
Hi,
I'm trying to understand the following code-gen difference in the
following code:
#ifdef BEFORE
typedef int arr_2[4][5];
#else
typedef int arr_2[4][4];
#endif
void foo (arr_2 a2, int i, int j)
{
a2[i + 2][j] = 2;
}
The code-gen using a mainline gcc -O2 -DBEFORE is:
movslq %esi, %rsi
movslq %edx, %rdx
leaq (%rsi,%rsi,4), %rax
leaq 40(%rdi,%rax,4), %rax
movl $2, (%rax,%rdx,4)
ret
and -O2 is:
movslq %esi, %rsi
movslq %edx, %rdx
addq $2, %rsi
salq $4, %rsi
addq %rdi, %rsi
movl $2, (%rsi,%rdx,4)
ret
When the typedef is arr_2[4][5], the compiler distributes (i + 2) * 20
to i * 20 + 40, while not doing the distribution when the typedef is
arr_2[4][4].
pointer_int_sum () in c-family/c-common.c applies the distributive law
to the non-pointer addend if the addend is a MULT_EXPR in a form of ((a
+/- c1) * c2), where both c1 and c2 are constant immediates. It claims
that "this helps produce common subexpressions".
In the example above, a2[i + 2] is transformed to
a2 + ((i * 16) + 32)
when arr_2 is typedefed as [4][4], and transformed to
a2 + ((i * 20) + 40)
in the case of [4][5].
However, later on fold_binary_loc () in fold-const.c will revert the
effort with the assistance from fold_plusminus_mult_expr (), which tries
to undistribute the plus (or minus). But for fold_plusminus_mult_expr
() to carry out the undistribution, c2 has to be a power of two, which
is why we see the code-gen difference (as 32 is a power of two, but 40
is not).
The -fdump-tree-original dumps already diff, so I guess that it is the
parser which does something interesting.
Can anyone enlighten me that which part of the parser I shall further
look into to get a better understanding of the code-gen difference?
Regards,
Yufeng
p.s.
diff of the -fdump-tree-original
- (*(a2 + ((sizetype) ((long unsigned int) i * 20) + 40)))[j] = 2;
+ (*(a2 + ((sizetype) i + 2) * 16))[j] = 2;
A similar code-gen difference can be observed in the following example code:
foo (float * a, float * b, int n, int j)
{
*a = b[j+50] + b[j-50];
}
You'll see the following -fdump-tree-original on x86_64 -O2:
*a = *(b + ((sizetype) j + 50) * 4) + *(b + ((sizetype) ((long unsigned
int) j * 4) + 18446744073709551416));
While b[j+50] is turned into (b + (j + 50) * 4), for b[j-50] the
multiplication has been distributed.
The difference is caused again by fold_plusminus_mult_expr () which
carries out the undistribution when host_integerp (200, 0) passes but
quits early when host_integerp (18446744073709551416, 0) fails.
This looks like a classic example where different parts of a compiler
disagree on which are better heuristics.
IMHO, array indexing expression with immediate offset(s) shall be
distributed to allow immediates to be folded (when there are multiple)
and re-associated to the rightmost side of the addressing expression (to
better utilize address mode with immediate offset). For example, given
the following example:
typedef struct { int x,y,a,b; } X;
int
foo (X p[][4], int x, int y)
{
return p[x+1][y+1].a;
}
the code-gen on arm -mcpu=cortex-a15 -O2 -mthumb is
add r2, r2, #1
add r1, r1, #1
mov r2, r2, asl #4
add r2, r2, r1, asl #6
add r0, r0, r2
ldr r0, [r0, #8]
which can be optimized to:
add r1, r0, r1, asl #6
add r2, r1, r2, asl #4
ldr r0, [r2, #88]
if the distribution is applied. Similarly the code-gen on x86_64 can be
optimized in a similar fashion from current poorer code-gen:
movslq %esi, %rsi
addl $1, %edx
addq $1, %rsi
movslq %edx, %rdx
salq $6, %rsi
salq $4, %rdx
addq %rdi, %rsi
movl 8(%rsi,%rdx), %eax
I'm currently experimenting with a new tree-ssa pass to apply the
distribution on a qualified addend of any POINTER_PLUS expression. I
deliberately avoid to change fold_binary_loc () or
fold_plusminus_mult_expr (), as I think both folding functions are
fairly mature and the folding can be beneficial in other cases.
I'm quite keen to hear what you think on this issue. Any comment is
welcomed!
Thanks,
Yufeng