Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]

Greg Ungerer <gerg@xxxxxxxxxxxxxx> · Thu, 12 May 2016 23:14:10 +1000

Hi George,

On 12/05/16 18:04, George Spelvin wrote:
Thanks for the response!

That mulsi3.S code was taken directly from gcc's lb1sf68.S (and that
was lb1sf68.asm in older versions of gcc). It has always had that
coldfire conditional code in it. Not sure why it was in there though.

I went digging through a lot of ColdFire documentation to see if
there were ever any that needed it, and apparently it's been in there
since the beginning.

It did occur to me that perhaps someone wanted to support tne
intersection of all 68k family CPUs, so that meant only MULU.W
and ADD.L.  But I can't see a gcc option to ask for that.

Mostly I just thought that saving a few cycles was a nice win.

Yep, that is always nice.

After a few test builds I am pretty sure that we never need or
use mulsi3 on ColdFire. So I think we can move to not even bothering
to compile it for that case.

This is tangential to some work I'm doing on the hash functions.
in <linux/hash.h>, there are hash_32(x, bits) and hash_64(x, bits)
which basically do "return (x * CONSTANT) >> (32 - bits)", or
64 - bits for the 64-bit version.

The CONSTANTs were chosen to be efficient when computed using
shift-and-add sequences, but that made them less effective hashes.

The 64-bit constant was recently revised to avoid a bad case found when
hashing page-aligned pointers, since it's a pointless optimization;
all 64-bit processors have fast multipliers.  But 32 bits was left alone
for fear of breaking performance.

After some research (partly by searching for __mulsi3), I found that m68k
is the only platform that doesn't have a fast multiplier.  SPARC used
to be in that boat too, but SPARCv7 syupport was dropped a few years
ago.

That led to me optimizing the multiply-by-constant code for m68k.

Do you have any interest in the matter?  Basically, the options are
to optimize the implementation of the multiplication, or come up with
a different hash function that works better in that environment.

It looks like only the original m68000 targets will benefit from
a faster mulsi3. But there is nothing wrong with making it faster
for that case in general.

I don't have any feel for how much of a performance problem the
hash functions are on m68000 though. Keep in mind that those are
all non-MMU platforms.

Regards
Greg

The current multiply by 0x9e370001 compiles to, depending on compiler
flags, either a call to __mulsi3 or (omitting prologue & epilogue,
since it's designed as an in-line function):

	move.l	%d0,%d1		 4
	lsl.l	#6,%d1		20  24
	move.l	%d1,%d2		 4  28
	lsl.l	#3,%d2		14  42
	sub.l	%d1,%d2		 8  50
	sub.l	%d0,%d2		 8  58
	move.l	%d2,%d1		 4  62
	lsl.l	#3,%d1		14  76
	sub.l	%d2,%d1		 8  84
	lsl.l	#3,%d1		14  98
	add.l	%d0,%d1		 8 106
	swap	%d1		 4 110
	clr.w	%d1		 4 114
	sub.l	%d1,%d0		 8 122

(The columns on the right are cycle counts.)

Tangent: I'm surprised I can't coax it into the equivalent:
	move.w %d0,%d1		 4
	mulu.w #0x9e37,%d1	58  62
	add.l %d1,%d0		 8  70

Anyway, the problem is the low Hamming weight of the low 16 bits,
which has to be solved somehow.  The "more random" high 16 bits of the
multiplier are only multiplied by the low 16 bits of the argument.
The high 16 bits of the argument are just multiplied by 1.  If the
argument is a page-aligned pointer, with 12 low-order bits zero, we have
very poor mixing.

Using the better constant 0x61C88647 (both this and the previous one are
chosen to be close to phi, the golden ratio (sqrt(5)-1)/2, but this is
the negative), gcc struggles.

I searched for an efficient shift-and-add sequence,
and came up with the 68000-optimized:

/* Multiply by 0x61C88647 */
unsigned hash_32(unsigned x)
{
	unsigned a, b, c;

	a = b = c = x;	
	c <<= 19;
	a += c;		/* a = (x << 19) + x */
	b <<= 9;
	b += a;		/* b = (x << 9) + a */
	c <<= 4;
	c += b;		/* c = (x << 23) + b */
	b <<= 5;
	b += c;		/* (b << 5) + c */
	b <<= 3;
	b += a;		/* ((b << 5) + c << 3) + a */
	b <<= 3;
	b -= c;		/* (((b << 5) + c << 3) + a << 3) - c */
	return x;
}

which gcc turns into:
	move.l	%d1,%d2		 4
	lsl.w	#3,%d2		12  16
	swap	%d2		 4  20
	clr.w	%d2		 4  24
	move.l	%d1,%a1		 4  28
	add.l	%d2,%a1		 8  36
	moveq	#9,%d0		 4  40
	lsl.l	%d0,%d1		26  66
	add.l	%a1,%d1		 8  74
	lsl.l	#4,%d2		16  90
	move.l	%d1,%a0		 4  94
	add.l	%d2,%a0		 8 102
	lsl.l	#5,%d1		18 120
	move.l	%a0,%d0		 4 124
	add.l	%d1,%d0		 8 132
	move.l	%d0,%d1		 4 136
	lsl.l	#3,%d1		14 150
	move.l	%a1,%d0		 4 154
	add.l	%d1,%d0		 8 162
	lsl.l	#3,%d0		14 176
	sub.l	%a0,%d0		 8 184

I can shave a few cycles with hand-optimized code, but the result is
barely faster than the much smaller multiply-based:

	; y = #k * x, with z as a temp
	move.w  #klow, y        ;  8
	move.w  x, z            ;  4     12
	swap    x               ;  4     16
	mulu.w  y, x            ; 52     68	(Using klow = 0x8647)
	mulu.w  z, y            ; 54    122	(Assuming 50% ones)
	mulu.w  #khigh, z       ; 54    176	(Using khigh = 0x61C8)
	swap    y               ;  4    180
	add.w   x, y            ;  4    184
	add.w   z, y            ;  4    188
	swap    y               ;  4    192
	;; 12 words

Another way to divide it up is to compute xlow * khigh with a multiply,
and x * klow with shifts and adds.  x * 0x8647 can be evauated as:

	unsigned a = (x << 9) + x;
	unsigned b = (x << 1) + a;
	unsigned c = (b << 1) + (a << 6) + a;
	return c + ((uint16_t)(x * 0x61C8) << 16);

If I write it in the following form:
	register unsigned a, b;

	b = x << 1;
	asm("" : "=g" (a) : "0" (b));	/* "a = b, DAMN IT" */
	a <<= 8;
	a += x;
	b += a;
	b += b;
	b += a;
	a <<= 6;
	return a + b + ((uint16_t)(x * 0x61c8) << 16);

The gcc can be coaxed into generating the desired code.
(The "asm" is the only way I found to stop GCC from collapsing
"(x << 1) << 8" to "x << 9", which is slower.)

baz:
	move.l 4(%sp),%d1
	move.l %d1,%a0		 4
	add.l %d1,%a0		 8  12
	move.l %a0,%d0		 4  16
	lsl.l #8,%d0		24  40
	add.l %d1,%d0		 8  48
	add.l %d0,%a0		 8  56
	muls.w #25032,%d1	56 112
	swap %d1		 4 116
	clr.w %d1		 4 120
	add.l %d0,%d1		 8 128
	add.l %a0,%d1		 8 136
	add.l %a0,%d1		 8 144
	lsl.l #6,%d0		20 164
	add.l %d1,%d0		 8 172
	rts

Well, that's a *little* bit faster.

Alternatively, I could just invent a whole different hash function for
the case.  One idea is to use two, rather than three, 16-bit multiplies.
Split the input into two 16-bit halves, multiply them by separate
constants, sum the results, and then swap the halves.

	move.w	x, y		 4
	swap	x		 4   8
	mulu.w	#K1, y		58  66
	mulu.w	#K2, x		58 124
	add.l	y, x		 8 132
	swap	x		 4 136

The swap is important; hash_32() is documented as putting the "most mixed"
bits in the most significant bits, as hash tables are indexed by the most
significant bits.

If you have any thoughts on the matter, they'd definitely be appreciated.

--
To unsubscribe from this list: send the line "unsubscribe linux-m68k" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html