RE: Inline asm for ARM

Pavel Pavlov <pavel@xxxxxxxxxxxxxx> · Wed, 16 Jun 2010 12:54:59 -0400

> -----Original Message-----
> Behalf Of Pavel Pavlov
> Sent: Wednesday, June 16, 2010 12:40
> To: Andrew Haley
> Cc: gcc-help@xxxxxxxxxxx
> Subject: RE: Inline asm for ARM
> 
> > -----Original Message-----
> > From: Andrew Haley [mailto:aph@xxxxxxxxxx] On 06/16/2010 05:11 PM,
> > Pavel Pavlov wrote:
> > >> -----Original Message-----
> > >> On 06/16/2010 01:15 PM, Andrew Haley wrote:
> > >>> On 06/16/2010 11:23 AM, Pavel Pavlov wrote:
> > > ...
> > >> inline uint64_t smlalbb(uint64_t acc, unsigned int lo, unsigned int hi) {
> > >>   union
> > >>   {
> > >>     uint64_t ll;
> > >>     struct
> > >>     {
> > >>       unsigned int l;
> > >>       unsigned int h;
> > >>     } s;
> > >>   } retval;
> > >>
> > >>   retval.ll = acc;
> > >>
> > >>   __asm__("smlalbb %0, %1, %2, %3"
> > >> 	  : "+r"(retval.s.l), "+r"(retval.s.h)
> > >> 	  : "r"(lo), "r"(hi));
> > >>
> > >>   return retval.ll;
> > >> }
> > >>
> > >
> > > [Pavel Pavlov]
> > > Later on I found out that I had to use +r constraint, but then, when
> > > I use that
> > function for example like that:
> > > int64_t rsmlalbb64(int64_t i, int x, int y) {
> > > 	return smlalbb64(i, x, y);
> > > }
> > >
> > > Gcc generates this asm:
> > > <rsmlalbb64>:
> > > push	{r4, r5}
> > > mov	r4, r0
> > > mov	ip, r1
> > > smlalbb	r4, ip, r2, r3
> > > mov	r5, ip
> > > mov	r0, r4
> > > mov	r1, ip
> > > pop	{r4, r5}
> > > bx	lr
> > >
> > > It's bizarre what gcc is doing in that function, I understand if it
> > > can't optimize and correctly use r0 and r1 directly, but from that
> > > listing it looks as if gcc got drunk and decided to touch r5 for
> > > absolutely no reason!
> > >
> > > the expected out should have been like that:
> > > <rsmlalbb64>:
> > > smlalbb	r0, r1, r2, r3
> > > bx	lr
> > >
> > > I'm using cegcc 4.1.0 and I compile with
> > > arm-mingw32ce-g++ -O3 -mcpu=arm1136j-s -c ARM_TEST.cpp -o
> > > arm-mingw32ce-g++ ARM_TEST_GCC.obj
> > >
> > > Is there a way to access individual parts of that 64-bit input
> > > integer or, is there a way to specify that two 32-bit integers
> > > should be treated as a Hi:Lo parts of 64 bit variable. It's commonly
> > > done with a temporary, but the result is that gcc generates to much junk.
> >
> > Why don't you just use the function I sent above?  It generates
> >
> > smlalbb:
> > 	smlalbb r0, r1, r2, r3
> > 	mov	pc, lr
> >
> > smlalXX64:
> > 	smlalbb r0, r1, r2, r3
> > 	smlalbt r0, r1, r2, r3
> > 	smlaltb r0, r1, r2, r3
> > 	smlaltt r0, r1, r2, r3
> > 	mov	pc, lr
> >
> 
> [Pavel Pavlov]
> What's your gcc -v? The output I posted comes from your function.

By the way, the version that takes hi:lo for the first int64 works fine:

static __inline void smlalbb(int * lo, int * hi, int x, int y)
{
#if defined(__CC_ARM)
	__asm { smlalbb *lo, *hi, x, y; }
#elif defined(__GNUC__)
	__asm__ __volatile__("smlalbb %0, %1, %2, %3" : "+r"(*lo), "+r"(*hi) : "r"(x), "r"(y));
#endif
}

void test_smlalXX(int hi, int lo, int a, int b)
{
	smlalbb(&hi, &lo, a, b);
	smlalbt(&hi, &lo, a, b);
	smlaltb(&hi, &lo, a, b);
	smlaltt(&hi, &lo, a, b);
}

Translates directly into four asm opcodes