On Thu, 13 Aug 2009, George Spelvin wrote: > > Wow. Is it now faster than the arm/ and ppc/ hand-tweaked assembly? > > It's probably faster than the ARM, which was tuned for size rather > than speed, but if you want to rework the assembly for speed, the ARM's > rotate-and-add operations allow tricks which I doubt GCC can pick up on. > (You have to notice that the F(b,c,d) function is bitwise, so you can > do it on rotated data and do the rotate when you add the result to e.) gcc is not too bad at merging ALU and shift/rotate operations into the same instruction. However, to make a really optimal ARM version, some custom SHA_ROUND macros with inline assembly could be made. I suspect that wouldn't gain much though, as the pure shift/rotate mov instructions really aren't that many in the generated code. > I'd be surprised if it were faster than PPC code, especially on the > in-order G3 and G4 cores where careful scheduling really pays off. > But maybe I just get to be surprised... Given that PPC has enough register to hold the entire state, it is then only a matter of proper instruction scheduling which modern gcc ought to do right. If not then this is a good test case for gcc people to fix the PPC machine pipeline description. > For automatic assembly tuning, I was thinking of having a .c file that > has a bunch of #ifdef __PPC__ statements that gets run through $(CC) -E. > That should be a fairly portable way to ?? > The other question about unaligned access is whether it's beneficial > to make the fetch loop work like this: > > char const *in; > uint32_t *out > unsigned lsb = (unsigned)p & 3; > uint32_t const *p32 = (uint32_t const *)(in - lsb); > uint32_t t = ntohl(*p32); > > switch (lsb) { > > case 0: > *out++ = t; > for (i = 1; i < 16; i++) > *out++ = ntohl(*++p32); > break; > case 1: > for (i = 0; i < 16; i++) { > uint32_t s = t << 8; > t = ntohl(*++p32); > *out++ = s | t >> 24; > } > break; > case 1: > for (i = 0; i < 16; i++) { > uint32_t s = t << 16; > t = ntohl(*++p32); > *out++ = s | t >> 16; > } > break; > case 1: > for (i = 0; i < 16; i++) { > uint32_t s = t << 24; > t = ntohl(*++p32); > *out++ = s | t >> 8; > } > break; > } > > On the ARM, at least, ntohl() isn't particularly cheap, so loading 4 > bytes and assembling them turns out to be cheaper. But it's a thought. Well, that would have to be tested. This could possibly only be a gain if you have a fast ntohl() though. And the other question is also to decide when this is good enough for a generic version. Gaining 5% speedup on raw SHA1 throughput with ugly code might not be worth the maintenance hassle. At that point you might as well go back to a purely asm version if you really want to get the extra edge. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html