[RFC] Re: Immediate offset in array index

Yufeng Zhang <Yufeng.Zhang@xxxxxxx> · Thu, 28 Nov 2013 19:44:59 +0000

(Add gcc@xxxxxxxxxxx which is more relevant.)

After some investigation, I can answer my own question now.

On 11/11/13 16:39, Yufeng Zhang wrote:
Hi,

I'm trying to understand the following code-gen difference in the
following code:

#ifdef BEFORE
typedef int arr_2[4][5];
#else
typedef int arr_2[4][4];
#endif

void foo (arr_2 a2, int i, int j)
{
    a2[i + 2][j] = 2;
}

The code-gen using a mainline gcc -O2 -DBEFORE is:

          movslq  %esi, %rsi
          movslq  %edx, %rdx
          leaq    (%rsi,%rsi,4), %rax
          leaq    40(%rdi,%rax,4), %rax
          movl    $2, (%rax,%rdx,4)
          ret

and -O2 is:

          movslq  %esi, %rsi
          movslq  %edx, %rdx
          addq    $2, %rsi
          salq    $4, %rsi
          addq    %rdi, %rsi
          movl    $2, (%rsi,%rdx,4)
          ret

When the typedef is arr_2[4][5], the compiler distributes (i + 2) * 20
to i * 20 + 40, while not doing the distribution when the typedef is
arr_2[4][4].

pointer_int_sum () in c-family/c-common.c applies the distributive law 
to the non-pointer addend if the addend is a MULT_EXPR in a form of ((a 
+/- c1) * c2), where both c1 and c2 are constant immediates.  It claims 
that "this helps produce common subexpressions".

In the example above, a2[i + 2] is transformed to
  a2 + ((i * 16) + 32)
when arr_2 is typedefed as [4][4], and transformed to
  a2 + ((i * 20) + 40)
in the case of [4][5].

However, later on fold_binary_loc () in fold-const.c will revert the 
effort with the assistance from fold_plusminus_mult_expr (), which tries 
to undistribute the plus (or minus).  But for fold_plusminus_mult_expr 
() to carry out the undistribution, c2 has to be a power of two, which 
is why we see the code-gen difference (as 32 is a power of two, but 40 
is not).

The -fdump-tree-original dumps already diff, so I guess that it is the
parser which does something interesting.

Can anyone enlighten me that which part of the parser I shall further
look into to get a better understanding of the code-gen difference?

Regards,
Yufeng

p.s.

diff of the -fdump-tree-original

-  (*(a2 + ((sizetype) ((long unsigned int) i * 20) + 40)))[j] = 2;
+  (*(a2 + ((sizetype) i + 2) * 16))[j] = 2;

A similar code-gen difference can be observed in the following example code:

foo (float * a, float * b, int n, int j)
{
  *a = b[j+50] + b[j-50];
}

You'll see the following -fdump-tree-original on x86_64 -O2:

*a = *(b + ((sizetype) j + 50) * 4) + *(b + ((sizetype) ((long unsigned 
int) j * 4) + 18446744073709551416));

While b[j+50] is turned into (b + (j + 50) * 4), for b[j-50] the 
multiplication has been distributed.

The difference is caused again by fold_plusminus_mult_expr () which 
carries out the undistribution when host_integerp (200, 0) passes but 
quits early when host_integerp (18446744073709551416, 0) fails.

This looks like a classic example where different parts of a compiler 
disagree on which are better heuristics.

IMHO, array indexing expression with immediate offset(s) shall be 
distributed to allow immediates to be folded (when there are multiple) 
and re-associated to the rightmost side of the addressing expression (to 
better utilize address mode with immediate offset).  For example, given 
the following example:

typedef struct { int x,y,a,b; } X;

int
foo (X p[][4], int x, int y)
{
  return p[x+1][y+1].a;
}

the code-gen on arm -mcpu=cortex-a15 -O2 -mthumb is

        add     r2, r2, #1
        add     r1, r1, #1
        mov     r2, r2, asl #4
        add     r2, r2, r1, asl #6
        add     r0, r0, r2
        ldr     r0, [r0, #8]

which can be optimized to:

        add     r1, r0, r1, asl #6
        add     r2, r1, r2, asl #4
        ldr     r0, [r2, #88]

if the distribution is applied.  Similarly the code-gen on x86_64 can be 
optimized in a similar fashion from current poorer code-gen:

        movslq  %esi, %rsi
        addl    $1, %edx
        addq    $1, %rsi
        movslq  %edx, %rdx
        salq    $6, %rsi
        salq    $4, %rdx
        addq    %rdi, %rsi
        movl    8(%rsi,%rdx), %eax

I'm currently experimenting with a new tree-ssa pass to apply the 
distribution on a qualified addend of any POINTER_PLUS expression.  I 
deliberately avoid to change fold_binary_loc () or 
fold_plusminus_mult_expr (), as I think both folding functions are 
fairly mature and the folding can be beneficial in other cases.

I'm quite keen to hear what you think on this issue.  Any comment is 
welcomed!

Thanks,
Yufeng