Re: Inline asm for ARM

Andrew Haley <aph@xxxxxxxxxx> · Wed, 16 Jun 2010 17:59:40 +0100

On 06/16/2010 05:57 PM, Pavel Pavlov wrote:
> 
> 
>> -----Original Message-----
>> From: gcc-help-owner@xxxxxxxxxxx [mailto:gcc-help-owner@xxxxxxxxxxx] On
>> Behalf Of Andrew Haley
>> Sent: Wednesday, June 16, 2010 12:52
>> To: gcc-help@xxxxxxxxxxx
>> Subject: Re: Inline asm for ARM
>>
>> On 06/16/2010 05:40 PM, Pavel Pavlov wrote:
>>>> -----Original Message-----
>>>> From: Andrew Haley [mailto:aph@xxxxxxxxxx] On 06/16/2010 05:11 PM,
>>>> Pavel Pavlov wrote:
>>>>>> -----Original Message-----
>>>>>> On 06/16/2010 01:15 PM, Andrew Haley wrote:
>>>>>>> On 06/16/2010 11:23 AM, Pavel Pavlov wrote:
>>>>> ...
>>>>>> inline uint64_t smlalbb(uint64_t acc, unsigned int lo, unsigned int hi) {
>>>>>>   union
>>>>>>   {
>>>>>>     uint64_t ll;
>>>>>>     struct
>>>>>>     {
>>>>>>       unsigned int l;
>>>>>>       unsigned int h;
>>>>>>     } s;
>>>>>>   } retval;
>>>>>>
>>>>>>   retval.ll = acc;
>>>>>>
>>>>>>   __asm__("smlalbb %0, %1, %2, %3"
>>>>>> 	  : "+r"(retval.s.l), "+r"(retval.s.h)
>>>>>> 	  : "r"(lo), "r"(hi));
>>>>>>
>>>>>>   return retval.ll;
>>>>>> }
>>>>>>
>>>>>
>>>>> [Pavel Pavlov]
>>>>> Later on I found out that I had to use +r constraint, but then, when
>>>>> I use that
>>>> function for example like that:
>>>>> int64_t rsmlalbb64(int64_t i, int x, int y) {
>>>>> 	return smlalbb64(i, x, y);
>>>>> }
>>>>>
>>>>> Gcc generates this asm:
>>>>> <rsmlalbb64>:
>>>>> push	{r4, r5}
>>>>> mov	r4, r0
>>>>> mov	ip, r1
>>>>> smlalbb	r4, ip, r2, r3
>>>>> mov	r5, ip
>>>>> mov	r0, r4
>>>>> mov	r1, ip
>>>>> pop	{r4, r5}
>>>>> bx	lr
>>>>>
>>>>> It's bizarre what gcc is doing in that function, I understand if it
>>>>> can't optimize and correctly use r0 and r1 directly, but from that
>>>>> listing it looks as if gcc got drunk and decided to touch r5 for
>>>>> absolutely no reason!
>>>>>
>>>>> the expected out should have been like that:
>>>>> <rsmlalbb64>:
>>>>> smlalbb	r0, r1, r2, r3
>>>>> bx	lr
>>>>>
>>>>> I'm using cegcc 4.1.0 and I compile with
>>>>> arm-mingw32ce-g++ -O3 -mcpu=arm1136j-s -c ARM_TEST.cpp -o
>>>>> arm-mingw32ce-g++ ARM_TEST_GCC.obj
>>>>>
>>>>> Is there a way to access individual parts of that 64-bit input
>>>>> integer or, is there a way to specify that two 32-bit integers
>>>>> should be treated as a Hi:Lo parts of 64 bit variable. It's commonly
>>>>> done with a temporary, but the result is that gcc generates to much junk.
>>>>
>>>> Why don't you just use the function I sent above?  It generates
>>>>
>>>> smlalbb:
>>>> 	smlalbb r0, r1, r2, r3
>>>> 	mov	pc, lr
>>>>
>>>> smlalXX64:
>>>> 	smlalbb r0, r1, r2, r3
>>>> 	smlalbt r0, r1, r2, r3
>>>> 	smlaltb r0, r1, r2, r3
>>>> 	smlaltt r0, r1, r2, r3
>>>> 	mov	pc, lr
>>>>
>>>
>>> [Pavel Pavlov]
>>> What's your gcc -v? The output I posted comes from your function.
>>
>> 4.3.0
>>
>> Perhaps your compiler options were wrong?  Dunno.
>>
> 
> 
>  [Pavel Pavlov] 
> It's kind of difficult ot get that part wrong :)

It's not.  Trust me, I have been on gcc-help for _long_ while...

I've even seen complains about poor code when optimization is disabled.

Andrew.

 I saw that there are some changes between 4.1.0 and 4.3.0 in arm code, optimizer code might have been improved between the two versions as well. So, I'm building 4.4.0 now to see if it fixes the problem.