Re: x86 SHA1: Faster than OpenSSL

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 3 Aug 2009 23:40:35 -0700 (PDT)

On Mon, 4 Aug 2009, George Spelvin wrote:
> +sha1_block_data_order:
> +	pushl	%ebp
> +	pushl	%ebx
> +	pushl	%esi
> +	pushl	%edi
> +	movl	20(%esp),%edi
> +	movl	24(%esp),%esi
> +	movl	28(%esp),%eax
> +	subl	$64,%esp
> +	shll	$6,%eax
> +	addl	%esi,%eax
> +	movl	%eax,92(%esp)
> +	movl	16(%edi),%ebp
> +	movl	12(%edi),%edx
> +.align	16
> +.L000loop:
> +	movl	(%esi),%ecx
> +	movl	4(%esi),%ebx
> +	bswap	%ecx
> +	movl	8(%esi),%eax
> +	bswap	%ebx
> +	movl	%ecx,(%esp)

...

Hmm. Does it really help to do the bswap as a separate initial phase?

As far as I can tell, you load the result of the bswap just a single time 
for each value. So the initial "bswap all 64 bytes" seems pointless.

> +	/* 00_15 0 */
> +	movl	%edx,%edi
> +	movl	(%esp),%esi

Why not do the bswap here instead?

Is it because you're running out of registers for scheduling, and want to 
use the stack pointer rather than the original source?

Or does the data dependency end up being so much better that you're better 
off doing a separate bswap loop?

Or is it just because the code was written that way?

Intriguing, either way.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html