Re: Spurious optimization failures - unnecessary stack frame management

Tom Bachmann <e_mc_h2@xxxxxx> · Tue, 09 Jul 2013 15:33:54 +0200

On 09.07.2013 13:37, Andrew Haley wrote:
On 07/09/2013 11:29 AM, Tom Bachmann wrote:

...the optimizer has to eliminate many temporaries, inline calls,
track pointers etc. It seems to me that, for no apparent reason,
this goes wrong sometimes. For example, in g++-4.6.4 or g++-4.8.1,
both of the above functions yield essentially equal machine code,
with a stack frame size of about 56 bytes. On the other hand,
g++-4.7.3 produces the attached code [NB: this is compiled without
exception suppert, to simplify comparison to the pure C code]. (I
obtained this via objdump, since I did not find the extra labels etc
produced by g++ -S helpful.) Notice that the stack frame size has
grown to 376 bytes! I have been trying to understand the produced
code, but could not make much sense of it.

It's hard to be precise without analysing your code in detail, but:

As a general rule, x86-64 is very sensitive to register pressure.  It
happens often that what appears to be a minor inlining decision tips
the register allocator over the edge, and we start to need a lot of
spill slots.

I do not think this is what is happening here. Most of the stackframe is 
never even initialized. Instead the frame holds a local temporary object 
which is not properly "decomposed" into aggregates for some reason.

An excerpt (annotated by me) of the optimized tree output is this:

  add1_right = &MEM[(const struct fmpz_data &)b_2(D)].f;
  add1_left = &MEM[(const struct fmpz_data &)a_1(D)].f;
  add2_left = &MEM[(const struct fmpz_data &)c_3(D)].f;
  output = &MEM[(fmpz[1] &)out_5(D)][0];

  SR.345_48 = &MEM[(const struct fmpz_data &)d_4(D)].f;
  D.45275.head.D.33631.data.f = add1_left;

D.45275.tail.head.D.39531.data.head.D.38556.data.head.D.37581.data.head.D.33631.data.f 
= add1_right;

D.45275.tail.head.D.39531.data.head.D.38556.data.head.D.37581.data.tail.head.D.36483.data.head.D.33631.data.f 
= add2_left;

D.45275.tail.head.D.39531.data.head.D.38556.data.head.D.37581.data.tail.head.D.36483.data.tail.head.D.35310.data.head.D.33631.data.f 
= add1_left;

D.45275.tail.head.D.39531.data.head.D.38556.data.head.D.37581.data.tail.head.D.36483.data.tail.head.D.35310.data.tail.head.D.33631.data.f 
= add1_right;

D.45275.tail.head.D.39531.data.head.D.38556.data.tail.head.D.33631.data.f = 
add2_left;
  D.45275.tail.head.D.39531.data.tail.head.D.33631.data.f = SR.345_48;
  MEM[(struct expression *)&D.40621].data = D.45275;
  D.45275 ={v} {CLOBBER};

;; init the temporary
  MEM[(fmpz *)&temps_backing] = 0;

;; find arguments
  add6_left = MEM[(struct fmpzxx_expression *)&D.40621 + 
8B].D.33631.data.f;
  add5_right = MEM[(struct fmpzxx_expression *)&D.40621 + 
128B].D.33631.data.f;
  add4_right = MEM[(struct fmpzxx_expression *)&D.40621 + 
112B].D.33631.data.f;
  add3_left = MEM[(struct fmpzxx_expression *)&D.40621 + 
40B].D.33631.data.f;

;; function calls
  fmpz_add (&MEM[(fmpz[1] &)&temps_backing][0], add1_left, add1_right);
  fmpz_add (&MEM[(fmpz[1] &)&temps_backing][0], add2_left, &MEM[(const 
struct fmpz_data &)&temps_backing].f);
  fmpz_add (&MEM[(fmpz[1] &)&temps_backing][0], add3_left, &MEM[(const 
struct fmpz_data &)&temps_backing].f);
  fmpz_add (&MEM[(fmpz[1] &)&temps_backing][0], &MEM[(const struct 
fmpz_data &)&temps_backing].f, add4_right);
  fmpz_add (&MEM[(fmpz[1] &)&temps_backing][0], &MEM[(const struct 
fmpz_data &)&temps_backing].f, add5_right);
  fmpz_add (output, add6_left, &MEM[(const struct fmpz_data 
&)&temps_backing].f);

For reasons which are not clear to me, add_3* to add_6* are not 
recognised as aliases to the add_1* and add_2*, and so the intermediates 
is not dispensed with and yield the big stackframe.

But it is time for the idea that a programmer can write arbitrarily
awful code and just expect the optimizer to sort it all out to die.
Sometimes GCC can do amazing things, and sometimes the tiniest tweak
will mean that your beautifully optimized routine no longer optimizes
so well.  No matter what we do, this will always be so if you push a
compiler to the edge.

It seems like on a practical level I will have to accept this, and 
rethink how my library relies on the optimizer. I really would have 
thought the kind of optimization needed above is performed reliably.

Thanks for your response.

Regards,
Tom