Inline asm for ARM

Pavel Pavlov <pavel@xxxxxxxxxxxxxx> · Wed, 16 Jun 2010 06:23:49 -0400

I spent hours to get it working properly, but it seems that I can't find a way to do it properly.
In arm 5te, there is an instruction SMLALBB http://bit.ly/amvRVv
SMLALBB RdLo, RdHi, Rm, Rs
Multiples bottom 16 bits of Rm by bottom 16 bits of Rs and adds 32 bit result to 64 bit integer represented by a pair of register RdLo, RdHi.
So, I tried everything I can and it seems that I can't get it working.

The closest try was:
static __inline void smlalbb(int * lo, int * hi, int x, int y)
{
	__asm__ __volatile__("smlalbb %0, %1, %2, %3" : "=&r"(lo), "=&r"(hi) : "r"(x), "r"(y), "0"(lo), "1"(hi));
}

It seem to produce correct result, but that worked only for simple test function, if I chained calls to this smlalbb function the results weren't correct anymore.

The correct way would probably have to use (*lo) and (*hi) as part of register lists, but in that case it adds too many useless loads and stores (instead of translating directly to a single asm instruction it would generate like 8-10 instructions).

In armcc (arm's compiler) that function looks simply like this:
__asm { smlalbb *lo, *hi, x, y; }

Along that plain function that should really translate directly to an asm instruction (if passed parameters are all registers), I wanted to also write and inline function that would take an int64_t as a first parameter instead of a pair of registers. Under the hood int64_t & or int64_t * is passed by a pair of registers, but if I tried to write a function that does that, then it generates too much housekeeping junk for loading parts of that int64_t variable into other temporary registers and after result was obtained it writes back the temporary registers into old original registers instead of operating directly on that original registers (does it make sense what I'm trying to say here?)
As an example here's the function that gets generated:
000000c0 <smlalbb>:
  c0:	e92d4030 	push	{r4, r5, lr}
  c4:	e591c000 	ldr	ip, [r1]
  c8:	e590e000 	ldr	lr, [r0]
  cc:	e1a0500c 	mov	r5, ip
  d0:	e1a0400e 	mov	r4, lr
  d4:	e1454382 	smlalbb	r4, r5, r2, r3
  d8:	e5804000 	str	r4, [r0]
  dc:	e5815000 	str	r5, [r1]
  e0:	e8bd8030 	pop	{r4, r5, pc}

Any idea if that can be done at all with inline asm or not?
For the int64_t version, here's the best result that I got:
000000d4 <smlalXX64>:
  d4:	e52de004 	push	{lr}		; (str lr, [sp, #-4]!)
  d8:	e24dd008 	sub	sp, sp, #8
  dc:	e88d0003 	stm	sp, {r0, r1}
  e0:	e89d0003 	ldm	sp, {r0, r1}
  e4:	e1a0e000 	mov	lr, r0
  e8:	e1a0c001 	mov	ip, r1
  ec:	e14ce382 	smlalbb	lr, ip, r2, r3
  f0:	e58de000 	str	lr, [sp]
  f4:	e58dc004 	str	ip, [sp, #4]
  f8:	e1a0000e 	mov	r0, lr
  fc:	e1a0100c 	mov	r1, ip
 100:	e14103c2 	smlalbt	r0, r1, r2, r3
 104:	e88d0003 	stm	sp, {r0, r1}
 108:	e1a0e000 	mov	lr, r0
 10c:	e1a0c001 	mov	ip, r1
 110:	e14ce3a2 	smlaltb	lr, ip, r2, r3
 114:	e58de000 	str	lr, [sp]
 118:	e58dc004 	str	ip, [sp, #4]
 11c:	e14ce3e2 	smlaltt	lr, ip, r2, r3
 120:	e28dd008 	add	sp, sp, #8
 124:	e8bd8000 	pop	{pc}

And the c function was:
void smlalXX64 (int64_t i, int a, int b)
{
	smlalbb64(&i, a, b);
	smlalbt64(&i, a, b);
	smlaltb64(&i, a, b);
	smlaltt64(&i, a, b);
}

As a second question (it could be related),
For a simple saturated add instruction (returns a+b):
static __inline int qadd(int x, int y)
{
	int ret__;
               __asm__(qadd %0, %1, %2        " : "=r"(ret__) : "r"(arg1), "r"(arg2));
              return ret__;
}

If I then write a simple function to test if gcc generates expected sequence it generates that kind of code for all test functions:
00000000 <_qadd>:
   0:	e1011050 	qadd	r1, r0, r1
   4:	e1a00001 	mov	r0, r1
   8:	e12fff1e 	bx	lr
Is it so dumb that it always doesn't try to use r0 directly as output register, or I did something wrong? Even microsoft arm compiler gets that right.