On Tue, 11 Aug 2009, Linus Torvalds wrote: > On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > > > #define SHA_SRC(t) \ > > ({ unsigned char *__d = (unsigned char *)&data[t]; \ > > (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); }) > > > > And this provides the exact same performance as the ntohl() based > > version (4.980s) except that this now cope with unaligned buffers too. > > Is it better to do a (conditional) memcpy up front? Or is the byte-based > one better just because you always end up doing the shifting anyway due to > most ARM situations being little-endian? The vast majority of ARM processors where git might be run are using a LE environment. > I _suspect_ that most large SHA1 calls from git are pre-aligned. The big > SHA1 calls are for pack-file verification in fsck, which should all be > aligned. Same goes for index file integrity checking. > > The actual object SHA1 calculations are likely not aligned (we do that > object header thing), and if you can't do the htonl() any better way I > guess the byte-based thing is the way to go.. Let's see. The default ntohl() provided by glibc generates this code: ldr r3, [r0, #0] mov r0, r3, lsr #24 and r2, r3, #0x00ff0000 orr r0, r0, r3, asl #24 orr r0, r0, r2, lsr #8 and r3, r3, #0x0000ff00 orr r0, r0, r3, asl #8 Ignoring the load result delay that gcc should properly schedule anyway, that makes for 7 cycles. Using the smarter ARM swab32 implementation from Linux we get: ldr r3, [r0, #0] eor r0, r3, r3, ror #16 bic r0, r0, #0x00ff0000 mov r0, r0, lsr #8 eor r0, r0, r3, ror #8 So we're down to 5 cycles. And the SHA1 test is a bit faster too: 4.570s down from 4.980s. However this is still using purely aligned buffers. Adding your patch using memcpy() to align the data in the unaligned case gives me wild results. Sometimes it is 4.930s, sometimes it is 5.560s. I suspect the icache starts to get tight and sometimes the SHA1 code and/or the special unaligned memcpy path get evicted sometimes. Using the byte access version we get: ldrb r3, [r0, #3] ldrb r2, [r0, #0] ldrb r1, [r0, #1] orr r3, r3, r2, asl #24 ldrb r0, [r0, #2] orr r3, r3, r1, asl #16 orr r0, r3, r0, asl #8 Again 7 cycles, like the ntohl() based version, which is coherent with the fact that they both make for 4.980s.. However this time any buffer alignment is supported. And in fact the extra 2 cycles over the swab32() version should actually be less overhead per word than the unaligned memcpy which is around 4 cycles per word. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html