RE: Inline asm for ARM

Pavel Pavlov <pavel@xxxxxxxxxxxxxx> · Thu, 17 Jun 2010 01:53:23 -0400

> -----Original Message-----
> From: Andrew Haley [mailto:aph@xxxxxxxxxx]
> Sent: Wednesday, June 16, 2010 13:23
> To: Pavel Pavlov
> Cc: gcc-help@xxxxxxxxxxx
> Subject: Re: Inline asm for ARM
> 
> On 06/16/2010 06:12 PM, Pavel Pavlov wrote:
> >> From: gcc-help-owner@xxxxxxxxxxx [mailto:gcc-help-owner@xxxxxxxxxxx]
> >> On Behalf Of Andrew Haley
> >>
> >>> By the way, the version that takes hi:lo for the first int64 works fine:
> >>>
> >>> static __inline void smlalbb(int * lo, int * hi, int x, int y) { #if
> >>> defined(__CC_ARM)
> >>> 	__asm { smlalbb *lo, *hi, x, y; }
> >>> #elif defined(__GNUC__)
> >>> 	__asm__ __volatile__("smlalbb %0, %1, %2, %3" : "+r"(*lo),
> >>> "+r"(*hi)
> >>> : "r"(x), "r"(y)); #endif }
> >>>
> >>>
> >>> void test_smlalXX(int hi, int lo, int a, int b) {
> >>> 	smlalbb(&hi, &lo, a, b);
> >>> 	smlalbt(&hi, &lo, a, b);
> >>> 	smlaltb(&hi, &lo, a, b);
> >>> 	smlaltt(&hi, &lo, a, b);
> >>> }
> >>>
> >>> Translates directly into four asm opcodes
> >>
> >> Mmmm, but the volatile is wrong.  If you need volatile to stop gcc
> >> from deleting your asm, you have a mistake somewhere.
> >
> > I had to add volatile when I had that mess with "=&r" and "0", now I
> > think it might be removed.
> 
> > Just tested, and I still need that. The reason I needed that was
> > because my test function was a noop:
> 
> > void test_smlalXX(int lo, int hi, int a, int b) {
> > 	smlalbb(&lo, &hi, a, b);
> > 	smlalbt(&lo, &hi, a, b);
> > 	smlaltb(&lo, &hi, a, b);
> > 	smlaltt(&lo, &hi, a, b);
> > }
> 
> > Gcc correctly guesses that there is no side effect from that function
> > if I don't use volatile.  So, I removed volatile and added return for
> > that function:
> >
> > uint64_t test_smlalXX(int lo, int hi, int a, int b) {
> > 	smlalbb(&lo, &hi, a, b);
> > 	smlalbt(&lo, &hi, a, b);
> > 	smlaltb(&lo, &hi, a, b);
> > 	smlaltt(&lo, &hi, a, b);
> >
> > 	T64 retval;
> >
> > 	retval.s.hi = hi;
> > 	retval.s.lo = lo;
> > 	return retval.i64;
> > }
> >
> > The output becomes:
> > 000000e4 <_Z12test_smlalXXiiii>:
> >   e4:	e92d0030 	push	{r4, r5}
> >   e8:	e1410382 	smlalbb	r0, r1, r2, r3
> >   ec:	e14103c2 	smlalbt	r0, r1, r2, r3
> >   f0:	e14103a2 	smlaltb	r0, r1, r2, r3
> >   f4:	e1a05001 	mov	r5, r1
> >   f8:	e14503e2 	smlaltt	r0, r5, r2, r3
> >   fc:	e1a04000 	mov	r4, r0
> >  100:	e1a01005 	mov	r1, r5
> >  104:	e8bd0030 	pop	{r4, r5}
> >  108:	e12fff1e 	bx	lr
> >
> > Basically gcc, gets confused about return variable and generates
> > useless gunk at the end for the last function. I tried to comment
> > smlaltt(&lo, &hi, a, b); in the test_smlalXX, and gcc still generates
> > that same useless code around smlattb
> 
> I have seen something similar with higher optimization levels, where some pass
> messes things up a bit.  Your
> 
>  	mov	r4, r0
> 
> is very weird, though.  I can't explain that.
> 
> -O1 generates perfect code for me, though.
> 
> Andrew.

[Pavel Pavlov] 
Final update: I recompiled cegcc 4.4.0 and it appears that the issues were resolved in 4.4.0 and now it generates perfect assembly. I'm working on code that should compile with gcc, armcc and Microsoft compiler for arm (wince).

For example, I have functions like these:
static __inline uint64_t smlalbb64(uint64_t i, int x, int y)
{
#if defined(_MSC_VER)
	return _SmulAddLo_SW_SQ(i, x, y);
#elif defined(__CC_ARM)
	U64 xx; xx.ll = i64;
	__asm{ smlalbb xx.s.lo, xx.s.hi, x, y };
	return xx.ll;
#elif defined(__GNUC__)
	U64 xx; xx.ll = i64;
	__asm__ ("smlalbb %0, %1, %2, %3" : "+r"(xx.s.lo), "+r"(xx.s.hi) : "r"(x), "r"(y));
	return xx.ll;
#else 
#error 123
#endif
}

Then this test function:
uint64_t test_smlalXX64(int lo, int hi, int a, int b)
{
	U64 xx;
	xx.s.lo = lo;
	xx.s.hi = hi;

	xx.ll = smlalbb64(xx.ll, a, b);
	xx.ll = smlalbt64(xx.ll, a, b);
	xx.ll = smlaltb64(xx.ll, a, b);
	xx.ll = smlaltt64(xx.ll, a, b);

	return xx.ll;
}

Generates into 4 arm instructions (plus bx lr) with gcc4.4.0 and ms's arm compiler (which is like 5 years old!); armcc compiler, still generates a few useless instructions. In case of smlalbb (the same as smlalbb64, but takes lo and hi pointers directly), all three compilers produce expected 4 arm instructions.

Thanks for help.