RE: Inline asm for ARM

Pavel Pavlov <pavel@xxxxxxxxxxxxxx> · Wed, 16 Jun 2010 13:12:03 -0400

> -----Original Message-----
> From: gcc-help-owner@xxxxxxxxxxx [mailto:gcc-help-owner@xxxxxxxxxxx] On
> Behalf Of Andrew Haley
> Sent: Wednesday, June 16, 2010 12:58
> To: gcc-help@xxxxxxxxxxx
> Subject: Re: Inline asm for ARM
> 
> On 06/16/2010 05:54 PM, Pavel Pavlov wrote:
> >> -----Original Message-----
> >> Behalf Of Pavel Pavlov
> >> Sent: Wednesday, June 16, 2010 12:40
> >> To: Andrew Haley
> >> Cc: gcc-help@xxxxxxxxxxx
> >> Subject: RE: Inline asm for ARM
> >>
> >>> -----Original Message-----
> >>> From: Andrew Haley [mailto:aph@xxxxxxxxxx] On 06/16/2010 05:11 PM,
> >>> Pavel Pavlov wrote:
> >>>>> -----Original Message-----
> >>>>> On 06/16/2010 01:15 PM, Andrew Haley wrote:
> >>>>>> On 06/16/2010 11:23 AM, Pavel Pavlov wrote:
> >>>> ...
> >>>>> inline uint64_t smlalbb(uint64_t acc, unsigned int lo, unsigned int hi) {
> >>>>>   union
> >>>>>   {
> >>>>>     uint64_t ll;
> >>>>>     struct
> >>>>>     {
> >>>>>       unsigned int l;
> >>>>>       unsigned int h;
> >>>>>     } s;
> >>>>>   } retval;
> >>>>>
> >>>>>   retval.ll = acc;
> >>>>>
> >>>>>   __asm__("smlalbb %0, %1, %2, %3"
> >>>>> 	  : "+r"(retval.s.l), "+r"(retval.s.h)
> >>>>> 	  : "r"(lo), "r"(hi));
> >>>>>
> >>>>>   return retval.ll;
> >>>>> }
> >>>>>
> >>>>
> >>>> [Pavel Pavlov]
> >>>> Later on I found out that I had to use +r constraint, but then,
> >>>> when I use that
> >>> function for example like that:
> >>>> int64_t rsmlalbb64(int64_t i, int x, int y) {
> >>>> 	return smlalbb64(i, x, y);
> >>>> }
> >>>>
> >>>> Gcc generates this asm:
> >>>> <rsmlalbb64>:
> >>>> push	{r4, r5}
> >>>> mov	r4, r0
> >>>> mov	ip, r1
> >>>> smlalbb	r4, ip, r2, r3
> >>>> mov	r5, ip
> >>>> mov	r0, r4
> >>>> mov	r1, ip
> >>>> pop	{r4, r5}
> >>>> bx	lr
> >>>>
> >>>> It's bizarre what gcc is doing in that function, I understand if it
> >>>> can't optimize and correctly use r0 and r1 directly, but from that
> >>>> listing it looks as if gcc got drunk and decided to touch r5 for
> >>>> absolutely no reason!
> >>>>
> >>>> the expected out should have been like that:
> >>>> <rsmlalbb64>:
> >>>> smlalbb	r0, r1, r2, r3
> >>>> bx	lr
> >>>>
> >>>> I'm using cegcc 4.1.0 and I compile with
> >>>> arm-mingw32ce-g++ -O3 -mcpu=arm1136j-s -c ARM_TEST.cpp -o
> >>>> arm-mingw32ce-g++ ARM_TEST_GCC.obj
> >>>>
> >>>> Is there a way to access individual parts of that 64-bit input
> >>>> integer or, is there a way to specify that two 32-bit integers
> >>>> should be treated as a Hi:Lo parts of 64 bit variable. It's
> >>>> commonly done with a temporary, but the result is that gcc generates to
> much junk.
> >>>
> >>> Why don't you just use the function I sent above?  It generates
> >>>
> >>> smlalbb:
> >>> 	smlalbb r0, r1, r2, r3
> >>> 	mov	pc, lr
> >>>
> >>> smlalXX64:
> >>> 	smlalbb r0, r1, r2, r3
> >>> 	smlalbt r0, r1, r2, r3
> >>> 	smlaltb r0, r1, r2, r3
> >>> 	smlaltt r0, r1, r2, r3
> >>> 	mov	pc, lr
> >>>
> >>
> >> [Pavel Pavlov]
> >> What's your gcc -v? The output I posted comes from your function.
> >
> > By the way, the version that takes hi:lo for the first int64 works fine:
> >
> > static __inline void smlalbb(int * lo, int * hi, int x, int y) { #if
> > defined(__CC_ARM)
> > 	__asm { smlalbb *lo, *hi, x, y; }
> > #elif defined(__GNUC__)
> > 	__asm__ __volatile__("smlalbb %0, %1, %2, %3" : "+r"(*lo), "+r"(*hi)
> > : "r"(x), "r"(y)); #endif }
> >
> >
> > void test_smlalXX(int hi, int lo, int a, int b) {
> > 	smlalbb(&hi, &lo, a, b);
> > 	smlalbt(&hi, &lo, a, b);
> > 	smlaltb(&hi, &lo, a, b);
> > 	smlaltt(&hi, &lo, a, b);
> > }
> >
> > Translates directly into four asm opcodes
> 
> Mmmm, but the volatile is wrong.  If you need volatile to stop gcc from deleting
> your asm, you have a mistake somewhere.
> 
> Andrew.

I had to add volatile when I had that mess with "=&r" and "0", now I think it might be removed.
Just tested, and I still need that. The reason I needed that was because my test function was a noop:
void test_smlalXX(int lo, int hi, int a, int b)
{
	smlalbb(&lo, &hi, a, b);
	smlalbt(&lo, &hi, a, b);
	smlaltb(&lo, &hi, a, b);
	smlaltt(&lo, &hi, a, b);
}
Gcc correctly guesses that there is no side effect from that function if I don't use volatile.
So, I removed volatile and added return for that function:

uint64_t test_smlalXX(int lo, int hi, int a, int b)
{
	smlalbb(&lo, &hi, a, b);
	smlalbt(&lo, &hi, a, b);
	smlaltb(&lo, &hi, a, b);
	smlaltt(&lo, &hi, a, b);

	T64 retval;

	retval.s.hi = hi;
	retval.s.lo = lo;
	return retval.i64;
}

The output becomes:
000000e4 <_Z12test_smlalXXiiii>:
  e4:	e92d0030 	push	{r4, r5}
  e8:	e1410382 	smlalbb	r0, r1, r2, r3
  ec:	e14103c2 	smlalbt	r0, r1, r2, r3
  f0:	e14103a2 	smlaltb	r0, r1, r2, r3
  f4:	e1a05001 	mov	r5, r1
  f8:	e14503e2 	smlaltt	r0, r5, r2, r3
  fc:	e1a04000 	mov	r4, r0
 100:	e1a01005 	mov	r1, r5
 104:	e8bd0030 	pop	{r4, r5}
 108:	e12fff1e 	bx	lr

Basically gcc, gets confused about return variable and generates useless gunk at the end  for the last function. I tried to comment smlaltt(&lo, &hi, a, b); in the test_smlalXX, and gcc still generates that same useless code around smlattb