On 06/16/2010 11:23 AM, Pavel Pavlov wrote: > I spent hours to get it working properly, but it seems that I can't find a way to do it properly. > In arm 5te, there is an instruction SMLALBB http://bit.ly/amvRVv > SMLALBB RdLo, RdHi, Rm, Rs > Multiples bottom 16 bits of Rm by bottom 16 bits of Rs and adds 32 bit result to 64 bit integer represented by a pair of register RdLo, RdHi. > So, I tried everything I can and it seems that I can't get it working. > > The closest try was: > static __inline void smlalbb(int * lo, int * hi, int x, int y) > { > __asm__ __volatile__("smlalbb %0, %1, %2, %3" : "=&r"(lo), "=&r"(hi) : "r"(x), "r"(y), "0"(lo), "1"(hi)); > } > > It seem to produce correct result, but that worked only for simple test function, if I chained calls to this smlalbb function the results weren't correct anymore. > > The correct way would probably have to use (*lo) and (*hi) as part of register lists, but in that case it adds too many useless loads and stores (instead of translating directly to a single asm instruction it would generate like 8-10 instructions). I think it should be inline uint64_t smlalbb(uint64_t acc, unsigned int lo, unsigned int hi) { union { uint64_t ll; unsigned int l; unsigned int h; } retval; retval.ll = acc; __asm__("smlalbb %0, %1, %2, %3" : "+r"(retval.l), "+r"(retval.h) : "r"(lo), "r"(hi)); return retval.ll; } uint64_t smlalXX64 (uint64_t i, unsigned int a, unsigned int b) { uint64_t tmp = i; tmp = smlalbb(tmp, a, b); tmp = smlalbt(tmp, a, b); tmp = smlaltb(tmp, a, b); tmp = smlaltt(tmp, a, b); return tmp; } Andrew.