Re: How to inline a huge m68k code?

Andrew Haley <aph@xxxxxxxxxx> · Mon, 13 Jul 2009 12:45:22 +0100

ami_stuff wrote:

>> I don't understand what your actual question is.  What do you want gcc
>> to do differently?
>>
>> gcc's longlong.h already has inlined assembly code for 32x32->64
>> multiplication.  For the 68060 it looks like this:
>>
>> #define umul_ppmm(xh, xl, a, b) \
>>   __asm__ ("| Inlined umul_ppmm\n"					\
>> 	   "	move%.l	%2,%/d0\n"					\
>> 	   "	move%.l	%3,%/d1\n"					\
>> 	   "	move%.l	%/d0,%/d2\n"					\
>> 	   "	swap	%/d0\n"						\
>> 	   "	move%.l	%/d1,%/d3\n"					\
>> 	   "	swap	%/d1\n"						\
>> 	   "	move%.w	%/d2,%/d4\n"					\
>> 	   "	mulu	%/d3,%/d4\n"					\
>> 	   "	mulu	%/d1,%/d2\n"					\
>> 	   "	mulu	%/d0,%/d3\n"					\
>> 	   "	mulu	%/d0,%/d1\n"					\
>> 	   "	move%.l	%/d4,%/d0\n"					\
>> 	   "	eor%.w	%/d0,%/d0\n"					\
>> 	   "	swap	%/d0\n"						\
>> 	   "	add%.l	%/d0,%/d2\n"					\
>> 	   "	add%.l	%/d3,%/d2\n"					\
>> 	   "	jcc	1f\n"						\
>> 	   "	add%.l	%#65536,%/d1\n"					\
>> 	   "1:	swap	%/d2\n"						\
>> 	   "	moveq	%#0,%/d0\n"					\
>> 	   "	move%.w	%/d2,%/d0\n"					\
>> 	   "	move%.w	%/d4,%/d2\n"					\
>> 	   "	move%.l	%/d2,%1\n"					\
>> 	   "	add%.l	%/d1,%/d0\n"					\
>> 	   "	move%.l	%/d0,%0"					\
>> 	   : "=g" ((USItype) (xh)),					\
>> 	     "=g" ((USItype) (xl))					\
>> 	   : "g" ((USItype) (a)),					\
>> 	     "g" ((USItype) (b))					\
>> 	   : "d0", "d1", "d2", "d3", "d4")
>>
> 
> But this code is slow compared to code I posted.

It looks much the same to me, four MULUs and a bunch of carry propagation.
What would you like to change?

> FFmpeg linked with object generated from asm code I posted is 5% faster
> (mp3 -> wav) on the real 68060 compared to default GCC asm code from
> longlong.h.
> 
> This audio decoder uses MULH:
> 
> http://gnunet.org/libextractor/doxygen/html/mpegaudiodec_8c-source.html
> 
> I want:
> 
> 1. use the code which I posted as an inline, but I don't know how to inline
> it correctly

There is a complete worked example just here, that Ian posted, with almost
exactly the same arguments.  Can't you use it as a model?

> 2. if the code is faster already with FFmpeg as an external link object, it
> will be even faster when inlined, so maybe default GCC code should be repleaced with code I
> posted?

Sure, but first we need to know why the code you posted is faster.  I can't
immediately see why it should be.

Maybe it's something to do with sign extension.  It's possible that gcc doesn't
use the umul_ppmm inline that Ian posted because you sign extend both args
before multiplying.

This can easily be worked around, but we need to see the code gcc generates.

Andrew.