On 03/26/2014 06:48 PM, Ian Lance Taylor wrote:
My point was that the out buffer's cur ptr gets loaded/stored all the time,
even stored more than once in succession on certain paths. Yes,
encode_noinline() could, and actually, will modify the cur ptr. But that
call is on a marked unlikely path, while the likely path doesn't contain any
calls, so could work entirely with registers. The loading/storing of cur on
the likely path is a pessimization that affects performance.
I hope this clarifies it. Is it then an optimizer issue?
I see what you mean. You want the compiler to pull the value out of
memory for the likely loop and then store it back into memory for the
unlikely case. That seems possible. My first thought is that that
would be a moderately costly optimization that would very rarely pay
off, but I could be wrong.
Thanks, that might be part of it, but it seems that something else is at
play here. To test the theory that the function call sabotages moving
the cur ptr to a register, I commented out the noinline attribute at
line 13. before encode_noinline(). There are no function calls, but now
I'm really puzzled:
Dump of assembler code for function encode_node_list(OutBuf&, Node*):
0x0000000000400600 <+0>: test %rsi,%rsi
0x0000000000400603 <+3>: je 0x40062f
<encode_node_list(OutBuf&, Node*)+47>
// load outbuf's cur ptr
0x0000000000400605 <+5>: mov (%rdi),%rax
0x0000000000400608 <+8>: jmp 0x400613
<encode_node_list(OutBuf&, Node*)+19>
0x000000000040060a <+10>: nopw 0x0(%rax,%rax,1)
0x0000000000400610 <+16>: mov %rcx,%rax
// load the data
0x0000000000400613 <+19>: mov (%rsi),%edx
// calc next cur
0x0000000000400615 <+21>: lea 0x4(%rax),%rcx
// store!
0x0000000000400619 <+25>: mov %rcx,(%rdi)
0x000000000040061c <+28>: cmp $0xff,%edx
0x0000000000400622 <+34>: jg 0x400631
<encode_node_list(OutBuf&, Node*)+49>
0x0000000000400624 <+36>: mov %edx,(%rax)
0x0000000000400626 <+38>: mov 0x8(%rsi),%rsi
0x000000000040062a <+42>: test %rsi,%rsi
0x000000000040062d <+45>: jne 0x400610
<encode_node_list(OutBuf&, Node*)+16>
0x000000000040062f <+47>: repz retq
0x0000000000400631 <+49>: lea 0x8(%rax),%rcx
0x0000000000400635 <+53>: cmp $0xffff,%edx
0x000000000040063b <+59>: movl $0x0,(%rax)
// store again!
0x0000000000400641 <+65>: mov %rcx,(%rdi)
0x0000000000400644 <+68>: jg 0x40064b
<encode_node_list(OutBuf&, Node*)+75>
0x0000000000400646 <+70>: mov %edx,0x4(%rax)
0x0000000000400649 <+73>: jmp 0x400626
<encode_node_list(OutBuf&, Node*)+38>
// from here: the code of encode_noinline()
0x000000000040064b <+75>: cmp $0xffffff,%edx
0x0000000000400651 <+81>: movl $0x0,0x4(%rax)
0x0000000000400658 <+88>: jg 0x400666
<encode_node_list(OutBuf&, Node*)+102>
0x000000000040065a <+90>: lea 0xc(%rax),%rcx
// and store again!
0x000000000040065e <+94>: mov %rcx,(%rdi)
0x0000000000400661 <+97>: mov %edx,0x8(%rax)
0x0000000000400664 <+100>: jmp 0x400626
<encode_node_list(OutBuf&, Node*)+38>
0x0000000000400666 <+102>: lea 0x10(%rax),%rcx
0x000000000040066a <+106>: movl $0x0,0x8(%rax)
// and again!
0x0000000000400671 <+113>: mov %rcx,(%rdi)
0x0000000000400674 <+116>: mov %edx,0xc(%rax)
0x0000000000400677 <+119>: jmp 0x400626
<encode_node_list(OutBuf&, Node*)+38>
I naively thought, that if everyhing is inlined, and for code so simple,
the ptr will be kept in a register all the time: loaded once at the
beginning, stored once at the end. What is going on?
I thought about aliasing rules, too. I deliberately chose int* instead
of char*, because in the latter case, the rules say thay writing to a
char* invalidates everything. But for an int*, writing an int to the
memory can't invalidate the pointer itself, because they are different
types. The strict aliasing rules say, if I'm not mistaken, that if I
write to a pointer, that will invalidate all values read from pointers
pointing to the same type and live in registers, so they must be
reloaded, unless the read or written pointer defined as __restrict,
which means the pointer isn't aliasing other pointers (of the same
type). Am I right?
If I compile w/ -fno-strict-aliasing, then the cur ptr will be reloaded
each time after a 0 write was performed, as expected. Interestingly, the
code is shorter than above by 17 bytes.
So with strict aliasing, the unnecessary loads are eliminated, but why
are there unnecessary stores?
Thanks, Peter