Re: problems with optimisation

David Brown <david@xxxxxxxxxxxxxxx> · Fri, 28 Dec 2012 17:33:59 +0100

On 28/12/12 16:19, Andrew Haley wrote:
> On 12/28/2012 10:25 AM, Kicer wrote:
>> Hi all
>>
>>
>> Last days I've found a problem with some certain code optimisations:
>>
>>
>> namespace 
>> {
>>   
>>   struct Base;
>>   
>>   struct Bit
>>   {
>> 	  const Base &m_p;
>> 	  const int m_pos;
>> 	  
>> 	  constexpr Bit(const Base &p, const int pos): m_p(p), m_pos(pos)
>> 	  {
>> 	  }
>> 	
>> 	  operator bool() const;  
>>   };
>>   
>>   struct Base
>>   {  
>> 	  const int m_port;
>> 	  constexpr Base(int p): m_port(p)
>> 	  {
>> 	  }
>> 	  
>> 	  operator char () const
>> 	  {
>> 		  char result;
>> 		    
>> 		  asm(
>> 			"in %%dx, %%al\n"
>> 			:"=a"(result)
>> 			:"d"(m_port)
>> 		  );
>> 		  
>> 		  //result = *(reinterpret_cast<char *>(m_port+32));
>> 		  
>> 		  return result;
>> 	  }
>> 	  
>> 	  Bit operator[] (int p) const
>> 	  {
>> 		  Bit r(*this, p);
>> 		  return r;
>> 	  }
>> 	
>>   };
>>
>>
>>   Bit::operator bool() const
>>   {
>> 	  const char v = m_p;
>> 	  const bool r = (v & (1 << m_pos)) > 0;
>> 	  
>> 	  return r;
>>   }
>>   
>>   struct Anc: public Base
>>   {
>> 	  const Base m_in;
>> 	  constexpr Anc(int o): Base(o), m_in(o - 1)
>> 	  {
>> 	  }
>> 	  
>> 	  const Base& getIn() const
>> 	  {
>> 		  return m_in;
>> 	  }
>> 	
>>   };
>>
>> }
>>
>> template<int v>
>> char foo()
>> {
>> 	Anc p(v), p2(v+2);
>> 	char r = p.getIn() + p2.getIn();
>> 	
>> 	//r += p[0]? 1: 0;                   //commented out at first step
>> 	r += p2[4]? 1 : 0;
>> 	
>> 	return r;
>> }
>>
>>
>> char bar()
>> {
>>   char r = foo<4>();
>>   
>>   r-= foo<6>();
>>   
>>   return r;
>> }
>>
>> there are 3 structs which looks more complex than the code they generate.
>> foo() and bar() are just ising those structs.
>> For the code above output is short and clear as expected: 
>>
>> but when I uncomment "//r += p[0]? 1: 0; " in foo(), the code becomes 
>> unexpectly large and unclear:
>>
> 
>>
>> compilation flags:
>> g++ -Os test.cpp -c -o test.o -std=c++11
>>
>>
>> this may seem to be a less important problem for x86 archs, but I'm affected 
>> with this problem on avr arch where memory is very limited. Can I somehow 
>> figure out why gcc resigns from generation clean code in second example?
> 
> With -O2 there's much less difference:
> 
> bar():								bar():
> .LFB14:								.LFB14:
> 	.cfi_startproc							.cfi_startproc
> 	movl	$3, %edx						movl	$3, %edx
> 	in %dx, %al							in %dx, %al
> 
> 	movb	$6, %dl					      |		movb	$4, %dl
> 	movl	%eax, %ecx						movl	%eax, %ecx
> 	in %dx, %al							in %dx, %al
> 
> 							      >		movb	$6, %dl
> 							      >		movl	%eax, %edi
> 							      >		in %dx, %al
> 							      >
> 	movb	$7, %dl							movb	$7, %dl
> 	movl	%eax, %esi						movl	%eax, %esi
> 							      >		andl	$1, %edi
> 	in %dx, %al							in %dx, %al
> 
> 	movl	%eax, %edi				      |		movl	%eax, %r8d
> 							      >		movsbl	%sil, %esi
> 	movb	$8, %dl							movb	$8, %dl
> 	subb	%dil, %cl				      |		subb	%r8b, %cl
> 	in %dx, %al							in %dx, %al
> 
> 	andl	$16, %esi				      |		addl	%edi, %ecx
> 							      >		testb	$16, %sil
> 	setne	%dl							setne	%dl
> 							      >		andl	$1, %esi
> 	addl	%edx, %ecx						addl	%edx, %ecx
> 							      >		subb	%sil, %cl
> 	testb	$16, %al						testb	$16, %al
> 	setne	%al							setne	%al
> 	subb	%al, %cl						subb	%al, %cl
> 	movl	%ecx, %eax						movl	%ecx, %eax
> 	ret								ret
> 
> 
> Without inlining GCC can't tell what your program is doing, and by using
> -Os you're preventing GCC from inlining.
> 
> Andrew.
> 

There are normally good reasons for picking -Os rather than -O2 for
small microcontrollers (the OP is targeting AVRs, which typically have
quite small program flash memories).

So the solution here is to manually declare the various functions as
"inline" (or at least "static", so that the compiler will inline them
automatically).  Very often, code that manipulates bits is horrible on a
target like the AVR if the function is not inline, and the compiler has
the bit number(s) as variables - but with inline code generation and
constant folding, you end up with only an instruction or two for
compile-time constant bit numbers.

(To the OP) - also note that there can be significant differences in the
types of code generation and optimisations for different backends.  I
assume you posted x86 assembly because you thought it would be more
familiar to people on this list, but I think it would be more important
to show the real assembly from the target you are using as you might see
different optimisations or missed optimisations.

Finally, there is a mailing list dedicated to gcc on the avr - it might
be worth posting there too, especially if you think the issue is
avr-specific.

David