Spurious optimization failures - unnecessary stack frame management

Tom Bachmann <e_mc_h2@xxxxxx> · Tue, 09 Jul 2013 12:29:23 +0200

Hi,

[please CC me, I am not subscribed to this list]

I am writing a C++ expression template wrapper library for FLINT [0]. I 
am finding that across gcc versions, and with no apparent pattern, the 
optimizer sometimes fails to properly eliminate stack frame management. 
Is this a known problem? What parameter values should one increase to 
have the optimizer do this more aggressively?

I am working on x86-64, if that is relevant.

Please excuse my being so vague, unfortunately I do not know much about 
optimizer internals. Let me show you an example. Consider the function

void
test_fmpzxx_asymadd_1 (fmpzxx& out, const fmpzxx& a,
        const fmpzxx& b, const fmpzxx& c, const fmpzxx& d)
{
    out = (a + (((b + (c + (a + b))) + c) + d));
}

The type fmpzxx has a single data member, which is a "long". One may 
obtain a pointer to this data member using the _fmpz() method. Using 
some expression template magic [1], the above line is turned into 
function calls to a C library, essentially equivalent to the following:

void
test_fmpzxx_asymadd_2 (fmpzxx& out, const fmpzxx& a,
        const fmpzxx& b, const fmpzxx& c, const fmpzxx& d)
{
    fmpz_t tmp;
    fmpz_init (tmp);

    fmpz_add (tmp, a._fmpz (), b._fmpz ());
    fmpz_add (tmp, c._fmpz (), tmp);
    fmpz_add (tmp, b._fmpz (), tmp);
    fmpz_add (tmp, tmp, c._fmpz ());
    fmpz_add (tmp, tmp, d._fmpz ());
    fmpz_add (out._fmpz(), a._fmpz (), tmp);

    fmpz_clear (tmp);
}

However, to attain this, the optimizer has to eliminate many 
temporaries, inline calls, track pointers etc. It seems to me that, for 
no apparent reason, this goes wrong sometimes. For example, in g++-4.6.4 
or g++-4.8.1, both of the above functions yield essentially equal 
machine code, with a stack frame size of about 56 bytes. On the other 
hand, g++-4.7.3 produces the attached code [NB: this is compiled without 
exception suppert, to simplify comparison to the pure C code]. (I 
obtained this via objdump, since I did not find the extra labels etc 
produced by g++ -S helpful.) Notice that the stack frame size has grown 
to 376 bytes! I have been trying to understand the produced code, but 
could not make much sense of it. Some parts of the stack frame are 
initialized, then copied around, and then other data is used in calling 
the C functions. It seems like the optimizer just stopped arbitrarily, 
presumably because of some heuristic cutoff. My main question is: is 
there a switch to tune this heuristic?

Please note that this problem is not specific to version 4.7.3. There 
are other (similar) examples where e.g. 4.7.3 optimizes just fine, but 
say 4.8.1 produces similarly silly code, etc.

Thanks,
Tom

[0] http://www.flintlib.org/
[1] It is a rather big library by now. I am trying to avoid showing the 
relevant c++ code. In particular all my attempts at isolating a "minimal 
problematic example" have caused the optimizer to kick in before the 
code reached an acceptably small size.

You can find all the code at https://github.com/ness01/flint2/tree/gsoc, 
the functions test_fmpzxx_asymadd_? discussed are found in 
cxx/test/t-codegen.cpp.
0000000000402d80 <test_fmpzxx_asymadd_1>:
  402d80:	48 89 5c 24 d0       	mov    %rbx,-0x30(%rsp)
  402d85:	48 89 6c 24 d8       	mov    %rbp,-0x28(%rsp)
  402d8a:	49 89 f1             	mov    %rsi,%r9
  402d8d:	4c 89 64 24 e0       	mov    %r12,-0x20(%rsp)
  402d92:	4c 89 6c 24 e8       	mov    %r13,-0x18(%rsp)
  402d97:	48 89 fd             	mov    %rdi,%rbp
  402d9a:	4c 89 74 24 f0       	mov    %r14,-0x10(%rsp)
  402d9f:	4c 89 7c 24 f8       	mov    %r15,-0x8(%rsp)
  402da4:	48 81 ec 78 01 00 00 	sub    $0x178,%rsp
  402dab:	4c 89 84 24 88 00 00 	mov    %r8,0x88(%rsp)
  402db2:	00 
  402db3:	48 89 74 24 10       	mov    %rsi,0x10(%rsp)
  402db8:	48 89 cb             	mov    %rcx,%rbx
  402dbb:	48 89 54 24 30       	mov    %rdx,0x30(%rsp)
  402dc0:	48 89 4c 24 40       	mov    %rcx,0x40(%rsp)
  402dc5:	48 8d bc 24 a8 00 00 	lea    0xa8(%rsp),%rdi
  402dcc:	00 
  402dcd:	48 89 74 24 50       	mov    %rsi,0x50(%rsp)
  402dd2:	48 89 54 24 58       	mov    %rdx,0x58(%rsp)
  402dd7:	48 8d 74 24 10       	lea    0x10(%rsp),%rsi
  402ddc:	48 89 4c 24 78       	mov    %rcx,0x78(%rsp)
  402de1:	b9 12 00 00 00       	mov    $0x12,%ecx
  402de6:	48 c7 04 24 00 00 00 	movq   $0x0,(%rsp)
  402ded:	00 
  402dee:	f3 48 a5             	rep movsq %ds:(%rsi),%es:(%rdi)
  402df1:	4c 89 ce             	mov    %r9,%rsi
  402df4:	48 89 e7             	mov    %rsp,%rdi
  402df7:	4c 8b bc 24 c8 00 00 	mov    0xc8(%rsp),%r15
  402dfe:	00 
  402dff:	4c 8b b4 24 10 01 00 	mov    0x110(%rsp),%r14
  402e06:	00 
  402e07:	4c 8b a4 24 a8 00 00 	mov    0xa8(%rsp),%r12
  402e0e:	00 
  402e0f:	4c 8b ac 24 20 01 00 	mov    0x120(%rsp),%r13
  402e16:	00 
  402e17:	e8 24 ee ff ff       	callq  401c40 <fmpz_add@plt>
  402e1c:	48 89 e2             	mov    %rsp,%rdx
  402e1f:	48 89 de             	mov    %rbx,%rsi
  402e22:	48 89 e7             	mov    %rsp,%rdi
  402e25:	e8 16 ee ff ff       	callq  401c40 <fmpz_add@plt>
  402e2a:	48 89 e2             	mov    %rsp,%rdx
  402e2d:	4c 89 fe             	mov    %r15,%rsi
  402e30:	48 89 e7             	mov    %rsp,%rdi
  402e33:	e8 08 ee ff ff       	callq  401c40 <fmpz_add@plt>
  402e38:	4c 89 f2             	mov    %r14,%rdx
  402e3b:	48 89 e6             	mov    %rsp,%rsi
  402e3e:	48 89 e7             	mov    %rsp,%rdi
  402e41:	e8 fa ed ff ff       	callq  401c40 <fmpz_add@plt>
  402e46:	4c 89 ea             	mov    %r13,%rdx
  402e49:	48 89 e6             	mov    %rsp,%rsi
  402e4c:	48 89 e7             	mov    %rsp,%rdi
  402e4f:	e8 ec ed ff ff       	callq  401c40 <fmpz_add@plt>
  402e54:	48 89 ef             	mov    %rbp,%rdi
  402e57:	48 89 e2             	mov    %rsp,%rdx
  402e5a:	4c 89 e6             	mov    %r12,%rsi
  402e5d:	e8 de ed ff ff       	callq  401c40 <fmpz_add@plt>
  402e62:	48 8b 3c 24          	mov    (%rsp),%rdi
  402e66:	48 89 f8             	mov    %rdi,%rax
  402e69:	48 c1 f8 3e          	sar    $0x3e,%rax
  402e6d:	48 83 f8 01          	cmp    $0x1,%rax
  402e71:	74 3d                	je     402eb0 <test_fmpzxx_asymadd_1+0x130>
  402e73:	48 8b 9c 24 48 01 00 	mov    0x148(%rsp),%rbx
  402e7a:	00 
  402e7b:	48 8b ac 24 50 01 00 	mov    0x150(%rsp),%rbp
  402e82:	00 
  402e83:	4c 8b a4 24 58 01 00 	mov    0x158(%rsp),%r12
  402e8a:	00 
  402e8b:	4c 8b ac 24 60 01 00 	mov    0x160(%rsp),%r13
  402e92:	00 
  402e93:	4c 8b b4 24 68 01 00 	mov    0x168(%rsp),%r14
  402e9a:	00 
  402e9b:	4c 8b bc 24 70 01 00 	mov    0x170(%rsp),%r15
  402ea2:	00 
  402ea3:	48 81 c4 78 01 00 00 	add    $0x178,%rsp
  402eaa:	c3                   	retq   
  402eab:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  402eb0:	e8 eb ed ff ff       	callq  401ca0 <_fmpz_clear_mpz@plt>
  402eb5:	eb bc                	jmp    402e73 <test_fmpzxx_asymadd_1+0xf3>
  402eb7:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
  402ebe:	00 00