On Mon, 4 Aug 2009, George Spelvin wrote: > +sha1_block_data_order: > + pushl %ebp > + pushl %ebx > + pushl %esi > + pushl %edi > + movl 20(%esp),%edi > + movl 24(%esp),%esi > + movl 28(%esp),%eax > + subl $64,%esp > + shll $6,%eax > + addl %esi,%eax > + movl %eax,92(%esp) > + movl 16(%edi),%ebp > + movl 12(%edi),%edx > +.align 16 > +.L000loop: > + movl (%esi),%ecx > + movl 4(%esi),%ebx > + bswap %ecx > + movl 8(%esi),%eax > + bswap %ebx > + movl %ecx,(%esp) ... Hmm. Does it really help to do the bswap as a separate initial phase? As far as I can tell, you load the result of the bswap just a single time for each value. So the initial "bswap all 64 bytes" seems pointless. > + /* 00_15 0 */ > + movl %edx,%edi > + movl (%esp),%esi Why not do the bswap here instead? Is it because you're running out of registers for scheduling, and want to use the stack pointer rather than the original source? Or does the data dependency end up being so much better that you're better off doing a separate bswap loop? Or is it just because the code was written that way? Intriguing, either way. Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html