From: Holger Lubitz
Sent: 24 December 2022 09:34
On Thu, 2022-12-22 at 10:41 +0000, David Laight wrote:
I wonder how much slower it is - m68k is likely to be microcoded
and I don't think instruction timings are actually available.
Not sure if these are in any way official, but
http://oldwww.nvg.ntnu.no/amiga/MC680x0_Sections/mc68030timing.HTML
I thought about that some more and remember seeing memory timings
on a logic analyser - and getting timings that (more or less)
implied sequential execution limited by the obvious memory (cache)
accesses.
The microcoding is more apparent in the large mid-instruction
interrupt stack frames - eg for page faults.
(There's also
http://oldwww.nvg.ntnu.no/amiga/MC680x0_Sections/mc68000timing.HTML
but that is probably only interesting to demo coders by now)
The fastest version probably uses subx (with carry) to generate
0/-1 and leaves +delta for the other result - but getting the
compares and branches in the right order is hard.
Guess it must have been over 20 years since I wrote any 68k asm, but
now I actually ended up installing Debian on qemu to experiment.
There are two interesting differences between 68k and x86 that can be
useful here: Unlike x86, MOV on 68k sets the flags. And also, subx
differs from sbb in that it resets the zero flag on a non-zero result,
but does not set it on a zero result. So if it is set, it must have
been set before.
Here are the two functions I came up with (tested only stand-alone, not
in a kernel build. Also no benchmarks because this 68040 is only
emulated)
#1 (optimized for minimum instruction count in loop,
68k + Coldfire ISA_B)
int strcmp1(const char *cs, const char *ct)
{
int res;
asm ("\n"
"1: move.b (%0)+,%2\n" /* get *cs */
" jeq 2f\n" /* end of first string? */
" cmp.b (%1)+,%2\n" /* compare *ct */
" jeq 1b\n" /* if equal, continue */
" jra 3f\n" /* else skip to tail */
"2: cmp.b (%1)+,%2\n" /* compare one last byte */
"3: subx.l %2, %2\n" /* -1 if borrow, 0 if not */
" jls 4f\n" /* if set, z is from sub.b */
The subx will set Z unless C was set.
So that doesn't seem right.
" moveq.l #1, %2\n" /* 1 if !borrow */
"4:"
: "+a" (cs), "+a" (ct), "=d" (res));
return res;
}
I think this should work:
(But the jc might need to be jnc.)
" moveq.l #0,%2\n" /* zero high bits of result */
"1: move.b (%1)+,%2\n" /* get *ct */
" jeq 2f\n" /* end of second string? */
" cmp.b (%0)+,%2\n" /* compare *cs */
" jeq 1b\n" /* if equal, continue */
" jc 4f /* return +ve */
" moveq.l #-1, %2\n" /* return -ve */
" jra 4f\n"
"2: move.b (%0),%2\n" /* check for matching strings */
"4:"
#2 (optimized for minimum code size,
Coldfire ISA_A compatible)
int strcmp2(const char *cs, const char *ct)
{
int res = 0, tmp = 0;
asm ("\n"
"1: move.b (%0)+,%2\n" /* get *cs */
" move.b (%1)+,%3\n" /* get *ct */
" subx.l %3,%2\n" /* compare a byte */
" jeq 2f\n" /* both inputs were zero */
That doesn't seem right.
Z will be set if either *ct is zero or the bytes match.
" tst.l %2\n" /* check result */
This only sets Z when it was already set by the subx.
" jeq 1b\n" /* if zero, continue */
"2:"
: "+a" (cs), "+a" (ct), "+d" (res), "+d" (tmp));
return res;
}
However, this one needs res and tmp to be set to zero, because we read
only bytes (no automatic zero-extend on 68k), but then do a long
operation on them. Coldfire ISA_A dropped cmpb, it only reappeared in
ISA_B.
So the real instruction count is likely to be two more, unless gcc
happens to have one or two zeros it can reuse.
I believe some of the other m68k asm functions are also missing
the "memory" 'clobber' and so could get mis-optimised.
In which case would that happen? This function doesn't clobber memory
and its result does get used. If gcc mistakenly thinks the parameters
haven't changed and uses a previously cached result, wouldn't that
apply to a C function too?
You need a memory 'clobber' on anything that READS memory as well
as writes it.
While I can write (or rather have written) m68k asm I don't have
a compiler.
Well, I now have an emulated Quadra 800 running Debian 68k.(Getting the
emulated networking to work reliably was a bit problematic, though. But
now it runs Kernel 6.0) qemu could emulate Coldfire too, but I am not
sure where I would find a distribution for that.
I did not attach a patch because it seems already to be decided that
the function is gone. But should anyone still want to include one (or
both) of these functions, just give credit to me and I'm fine.
Thinking further the fastest strcmp() probably uses big-endian word compares
with a check for a zero byte.
Especially on 64 bit systems that support misaligned loads.
But I'd need to think hard about the actual details.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)