Hi,
[please CC me, I am not subscribed to this list]
I am writing a C++ expression template wrapper library for FLINT [0]. I
am finding that across gcc versions, and with no apparent pattern, the
optimizer sometimes fails to properly eliminate stack frame management.
Is this a known problem? What parameter values should one increase to
have the optimizer do this more aggressively?
I am working on x86-64, if that is relevant.
Please excuse my being so vague, unfortunately I do not know much about
optimizer internals. Let me show you an example. Consider the function
void
test_fmpzxx_asymadd_1 (fmpzxx& out, const fmpzxx& a,
const fmpzxx& b, const fmpzxx& c, const fmpzxx& d)
{
out = (a + (((b + (c + (a + b))) + c) + d));
}
The type fmpzxx has a single data member, which is a "long". One may
obtain a pointer to this data member using the _fmpz() method. Using
some expression template magic [1], the above line is turned into
function calls to a C library, essentially equivalent to the following:
void
test_fmpzxx_asymadd_2 (fmpzxx& out, const fmpzxx& a,
const fmpzxx& b, const fmpzxx& c, const fmpzxx& d)
{
fmpz_t tmp;
fmpz_init (tmp);
fmpz_add (tmp, a._fmpz (), b._fmpz ());
fmpz_add (tmp, c._fmpz (), tmp);
fmpz_add (tmp, b._fmpz (), tmp);
fmpz_add (tmp, tmp, c._fmpz ());
fmpz_add (tmp, tmp, d._fmpz ());
fmpz_add (out._fmpz(), a._fmpz (), tmp);
fmpz_clear (tmp);
}
However, to attain this, the optimizer has to eliminate many
temporaries, inline calls, track pointers etc. It seems to me that, for
no apparent reason, this goes wrong sometimes. For example, in g++-4.6.4
or g++-4.8.1, both of the above functions yield essentially equal
machine code, with a stack frame size of about 56 bytes. On the other
hand, g++-4.7.3 produces the attached code [NB: this is compiled without
exception suppert, to simplify comparison to the pure C code]. (I
obtained this via objdump, since I did not find the extra labels etc
produced by g++ -S helpful.) Notice that the stack frame size has grown
to 376 bytes! I have been trying to understand the produced code, but
could not make much sense of it. Some parts of the stack frame are
initialized, then copied around, and then other data is used in calling
the C functions. It seems like the optimizer just stopped arbitrarily,
presumably because of some heuristic cutoff. My main question is: is
there a switch to tune this heuristic?
Please note that this problem is not specific to version 4.7.3. There
are other (similar) examples where e.g. 4.7.3 optimizes just fine, but
say 4.8.1 produces similarly silly code, etc.
Thanks,
Tom
[0] http://www.flintlib.org/
[1] It is a rather big library by now. I am trying to avoid showing the
relevant c++ code. In particular all my attempts at isolating a "minimal
problematic example" have caused the optimizer to kick in before the
code reached an acceptably small size.
You can find all the code at https://github.com/ness01/flint2/tree/gsoc,
the functions test_fmpzxx_asymadd_? discussed are found in
cxx/test/t-codegen.cpp.
0000000000402d80 <test_fmpzxx_asymadd_1>:
402d80: 48 89 5c 24 d0 mov %rbx,-0x30(%rsp)
402d85: 48 89 6c 24 d8 mov %rbp,-0x28(%rsp)
402d8a: 49 89 f1 mov %rsi,%r9
402d8d: 4c 89 64 24 e0 mov %r12,-0x20(%rsp)
402d92: 4c 89 6c 24 e8 mov %r13,-0x18(%rsp)
402d97: 48 89 fd mov %rdi,%rbp
402d9a: 4c 89 74 24 f0 mov %r14,-0x10(%rsp)
402d9f: 4c 89 7c 24 f8 mov %r15,-0x8(%rsp)
402da4: 48 81 ec 78 01 00 00 sub $0x178,%rsp
402dab: 4c 89 84 24 88 00 00 mov %r8,0x88(%rsp)
402db2: 00
402db3: 48 89 74 24 10 mov %rsi,0x10(%rsp)
402db8: 48 89 cb mov %rcx,%rbx
402dbb: 48 89 54 24 30 mov %rdx,0x30(%rsp)
402dc0: 48 89 4c 24 40 mov %rcx,0x40(%rsp)
402dc5: 48 8d bc 24 a8 00 00 lea 0xa8(%rsp),%rdi
402dcc: 00
402dcd: 48 89 74 24 50 mov %rsi,0x50(%rsp)
402dd2: 48 89 54 24 58 mov %rdx,0x58(%rsp)
402dd7: 48 8d 74 24 10 lea 0x10(%rsp),%rsi
402ddc: 48 89 4c 24 78 mov %rcx,0x78(%rsp)
402de1: b9 12 00 00 00 mov $0x12,%ecx
402de6: 48 c7 04 24 00 00 00 movq $0x0,(%rsp)
402ded: 00
402dee: f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi)
402df1: 4c 89 ce mov %r9,%rsi
402df4: 48 89 e7 mov %rsp,%rdi
402df7: 4c 8b bc 24 c8 00 00 mov 0xc8(%rsp),%r15
402dfe: 00
402dff: 4c 8b b4 24 10 01 00 mov 0x110(%rsp),%r14
402e06: 00
402e07: 4c 8b a4 24 a8 00 00 mov 0xa8(%rsp),%r12
402e0e: 00
402e0f: 4c 8b ac 24 20 01 00 mov 0x120(%rsp),%r13
402e16: 00
402e17: e8 24 ee ff ff callq 401c40 <fmpz_add@plt>
402e1c: 48 89 e2 mov %rsp,%rdx
402e1f: 48 89 de mov %rbx,%rsi
402e22: 48 89 e7 mov %rsp,%rdi
402e25: e8 16 ee ff ff callq 401c40 <fmpz_add@plt>
402e2a: 48 89 e2 mov %rsp,%rdx
402e2d: 4c 89 fe mov %r15,%rsi
402e30: 48 89 e7 mov %rsp,%rdi
402e33: e8 08 ee ff ff callq 401c40 <fmpz_add@plt>
402e38: 4c 89 f2 mov %r14,%rdx
402e3b: 48 89 e6 mov %rsp,%rsi
402e3e: 48 89 e7 mov %rsp,%rdi
402e41: e8 fa ed ff ff callq 401c40 <fmpz_add@plt>
402e46: 4c 89 ea mov %r13,%rdx
402e49: 48 89 e6 mov %rsp,%rsi
402e4c: 48 89 e7 mov %rsp,%rdi
402e4f: e8 ec ed ff ff callq 401c40 <fmpz_add@plt>
402e54: 48 89 ef mov %rbp,%rdi
402e57: 48 89 e2 mov %rsp,%rdx
402e5a: 4c 89 e6 mov %r12,%rsi
402e5d: e8 de ed ff ff callq 401c40 <fmpz_add@plt>
402e62: 48 8b 3c 24 mov (%rsp),%rdi
402e66: 48 89 f8 mov %rdi,%rax
402e69: 48 c1 f8 3e sar $0x3e,%rax
402e6d: 48 83 f8 01 cmp $0x1,%rax
402e71: 74 3d je 402eb0 <test_fmpzxx_asymadd_1+0x130>
402e73: 48 8b 9c 24 48 01 00 mov 0x148(%rsp),%rbx
402e7a: 00
402e7b: 48 8b ac 24 50 01 00 mov 0x150(%rsp),%rbp
402e82: 00
402e83: 4c 8b a4 24 58 01 00 mov 0x158(%rsp),%r12
402e8a: 00
402e8b: 4c 8b ac 24 60 01 00 mov 0x160(%rsp),%r13
402e92: 00
402e93: 4c 8b b4 24 68 01 00 mov 0x168(%rsp),%r14
402e9a: 00
402e9b: 4c 8b bc 24 70 01 00 mov 0x170(%rsp),%r15
402ea2: 00
402ea3: 48 81 c4 78 01 00 00 add $0x178,%rsp
402eaa: c3 retq
402eab: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
402eb0: e8 eb ed ff ff callq 401ca0 <_fmpz_clear_mpz@plt>
402eb5: eb bc jmp 402e73 <test_fmpzxx_asymadd_1+0xf3>
402eb7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
402ebe: 00 00