> Wow. Is it now faster than the arm/ and ppc/ hand-tweaked assembly? It's probably faster than the ARM, which was tuned for size rather than speed, but if you want to rework the assembly for speed, the ARM's rotate-and-add operations allow tricks which I doubt GCC can pick up on. (You have to notice that the F(b,c,d) function is bitwise, so you can do it on rotated data and do the rotate when you add the result to e.) I'd be surprised if it were faster than PPC code, especially on the in-order G3 and G4 cores where careful scheduling really pays off. But maybe I just get to be surprised... For automatic assembly tuning, I was thinking of having a .c file that has a bunch of #ifdef __PPC__ statements that gets run through $(CC) -E. That should be a fairly portable way to The other question about unaligned access is whether it's beneficial to make the fetch loop work like this: char const *in; uint32_t *out unsigned lsb = (unsigned)p & 3; uint32_t const *p32 = (uint32_t const *)(in - lsb); uint32_t t = ntohl(*p32); switch (lsb) { case 0: *out++ = t; for (i = 1; i < 16; i++) *out++ = ntohl(*++p32); break; case 1: for (i = 0; i < 16; i++) { uint32_t s = t << 8; t = ntohl(*++p32); *out++ = s | t >> 24; } break; case 1: for (i = 0; i < 16; i++) { uint32_t s = t << 16; t = ntohl(*++p32); *out++ = s | t >> 16; } break; case 1: for (i = 0; i < 16; i++) { uint32_t s = t << 24; t = ntohl(*++p32); *out++ = s | t >> 8; } break; } On the ARM, at least, ntohl() isn't particularly cheap, so loading 4 bytes and assembling them turns out to be cheaper. But it's a thought. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html