Re: [PATCH] block-sha1: more good unaligned memory access candidates

Nicolas Pitre <nico@xxxxxxx> · Thu, 13 Aug 2009 17:28:18 -0400 (EDT)

On Thu, 13 Aug 2009, George Spelvin wrote:

> > Wow.  Is it now faster than the arm/ and ppc/ hand-tweaked assembly?
> 
> It's probably faster than the ARM, which was tuned for size rather
> than speed, but if you want to rework the assembly for speed, the ARM's
> rotate-and-add operations allow tricks which I doubt GCC can pick up on.
> (You have to notice that the F(b,c,d) function is bitwise, so you can
> do it on rotated data and do the rotate when you add the result to e.)

gcc is not too bad at merging ALU and shift/rotate operations into the 
same instruction.  However, to make a really optimal ARM version, some 
custom SHA_ROUND macros with inline assembly could be made.  I suspect 
that wouldn't gain much though, as the pure shift/rotate mov 
instructions really aren't that many in the generated code.

> I'd be surprised if it were faster than PPC code, especially on the
> in-order G3 and G4 cores where careful scheduling really pays off.
> But maybe I just get to be surprised...

Given that PPC has enough register to hold the entire state, it is then 
only a matter of proper instruction scheduling which modern gcc ought to 
do right.  If not then this is a good test case for gcc people to fix 
the PPC machine pipeline description.

> For automatic assembly tuning, I was thinking of having a .c file that
> has a bunch of #ifdef __PPC__ statements that gets run through $(CC) -E.
> That should be a fairly portable way to 

??

> The other question about unaligned access is whether it's beneficial
> to make the fetch loop work like this:
> 
> 	char const *in;
> 	uint32_t *out
> 	unsigned lsb = (unsigned)p & 3;
> 	uint32_t const *p32 = (uint32_t const *)(in - lsb);
> 	uint32_t t = ntohl(*p32);
> 
> 	switch (lsb) {
> 
> 	case 0:
> 		*out++ = t;
> 		for (i = 1; i < 16; i++)
> 			*out++ = ntohl(*++p32);
> 		break;
> 	case 1:
> 		for (i = 0; i < 16; i++) {
> 			uint32_t s = t << 8;
> 			t = ntohl(*++p32);
> 			*out++ = s | t >> 24;
> 		}
> 		break;
> 	case 1:
> 		for (i = 0; i < 16; i++) {
> 			uint32_t s = t << 16;
> 			t = ntohl(*++p32);
> 			*out++ = s | t >> 16;
> 		}
> 		break;
> 	case 1:
> 		for (i = 0; i < 16; i++) {
> 			uint32_t s = t << 24;
> 			t = ntohl(*++p32);
> 			*out++ = s | t >> 8;
> 		}
> 		break;
> 	}
> 
> On the ARM, at least, ntohl() isn't particularly cheap, so loading 4
> bytes and assembling them turns out to be cheaper.  But it's a thought.

Well, that would have to be tested.  This could possibly only be a gain 
if you have a fast ntohl() though.

And the other question is also to decide when this is good enough for a 
generic version.  Gaining 5% speedup on raw SHA1 throughput with ugly 
code might not be worth the maintenance hassle.  At that point you might 
as well go back to a purely asm version if you really want to get the 
extra edge.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html