I spent hours to get it working properly, but it seems that I can't find a way to do it properly. In arm 5te, there is an instruction SMLALBB http://bit.ly/amvRVv SMLALBB RdLo, RdHi, Rm, Rs Multiples bottom 16 bits of Rm by bottom 16 bits of Rs and adds 32 bit result to 64 bit integer represented by a pair of register RdLo, RdHi. So, I tried everything I can and it seems that I can't get it working. The closest try was: static __inline void smlalbb(int * lo, int * hi, int x, int y) { __asm__ __volatile__("smlalbb %0, %1, %2, %3" : "=&r"(lo), "=&r"(hi) : "r"(x), "r"(y), "0"(lo), "1"(hi)); } It seem to produce correct result, but that worked only for simple test function, if I chained calls to this smlalbb function the results weren't correct anymore. The correct way would probably have to use (*lo) and (*hi) as part of register lists, but in that case it adds too many useless loads and stores (instead of translating directly to a single asm instruction it would generate like 8-10 instructions). In armcc (arm's compiler) that function looks simply like this: __asm { smlalbb *lo, *hi, x, y; } Along that plain function that should really translate directly to an asm instruction (if passed parameters are all registers), I wanted to also write and inline function that would take an int64_t as a first parameter instead of a pair of registers. Under the hood int64_t & or int64_t * is passed by a pair of registers, but if I tried to write a function that does that, then it generates too much housekeeping junk for loading parts of that int64_t variable into other temporary registers and after result was obtained it writes back the temporary registers into old original registers instead of operating directly on that original registers (does it make sense what I'm trying to say here?) As an example here's the function that gets generated: 000000c0 <smlalbb>: c0: e92d4030 push {r4, r5, lr} c4: e591c000 ldr ip, [r1] c8: e590e000 ldr lr, [r0] cc: e1a0500c mov r5, ip d0: e1a0400e mov r4, lr d4: e1454382 smlalbb r4, r5, r2, r3 d8: e5804000 str r4, [r0] dc: e5815000 str r5, [r1] e0: e8bd8030 pop {r4, r5, pc} Any idea if that can be done at all with inline asm or not? For the int64_t version, here's the best result that I got: 000000d4 <smlalXX64>: d4: e52de004 push {lr} ; (str lr, [sp, #-4]!) d8: e24dd008 sub sp, sp, #8 dc: e88d0003 stm sp, {r0, r1} e0: e89d0003 ldm sp, {r0, r1} e4: e1a0e000 mov lr, r0 e8: e1a0c001 mov ip, r1 ec: e14ce382 smlalbb lr, ip, r2, r3 f0: e58de000 str lr, [sp] f4: e58dc004 str ip, [sp, #4] f8: e1a0000e mov r0, lr fc: e1a0100c mov r1, ip 100: e14103c2 smlalbt r0, r1, r2, r3 104: e88d0003 stm sp, {r0, r1} 108: e1a0e000 mov lr, r0 10c: e1a0c001 mov ip, r1 110: e14ce3a2 smlaltb lr, ip, r2, r3 114: e58de000 str lr, [sp] 118: e58dc004 str ip, [sp, #4] 11c: e14ce3e2 smlaltt lr, ip, r2, r3 120: e28dd008 add sp, sp, #8 124: e8bd8000 pop {pc} And the c function was: void smlalXX64 (int64_t i, int a, int b) { smlalbb64(&i, a, b); smlalbt64(&i, a, b); smlaltb64(&i, a, b); smlaltt64(&i, a, b); } As a second question (it could be related), For a simple saturated add instruction (returns a+b): static __inline int qadd(int x, int y) { int ret__; __asm__(qadd %0, %1, %2 " : "=r"(ret__) : "r"(arg1), "r"(arg2)); return ret__; } If I then write a simple function to test if gcc generates expected sequence it generates that kind of code for all test functions: 00000000 <_qadd>: 0: e1011050 qadd r1, r0, r1 4: e1a00001 mov r0, r1 8: e12fff1e bx lr Is it so dumb that it always doesn't try to use r0 directly as output register, or I did something wrong? Even microsoft arm compiler gets that right.