Re: Linus' sha1 is much faster!

Nicolas Pitre <nico@xxxxxxx> · Mon, 17 Aug 2009 13:06:45 -0400 (EDT)

On Mon, 17 Aug 2009, George Spelvin wrote:

> If it helps anyone resolve license issues, here's a from-FIPS-180-2
> implementation that's placed in the public domain.  That should be
> compatible with any license.
> 
> It uses Linus's and Artur's performance ideas, and some of Linus' macro
> ideas (in the rotate implementation), but tries to be textually different.
> Is there anything recognizable that anyone cares to clam copyright to?

I don't think this trick of making source code textually different from 
another work while still intimately mimicking the same structure entitles 
you to any copyright (or non copyright) claims over that other work.  I 
certainly wouldn't bet any dime for this standing up in court.  
Otherwise anyone could grab any copyrighted source code and perform a 
bunch of search-and-replace ops on it, and maybe some code reordering 
for good measure, to be able to claim own copyright on it. It is 
probably much safer to simply ask the people involved to agree with your 
relicensing.  And so far I don't see anyone with a stake in this 
fiercely wanting to stick to a particular license.

> It's not quite 100% finished, as I haven't benchmarked it against Linus's
> code yet, but it's functionally correct.
> 
> It's also clean with -W -Wall -Wextra.

Not if you try with the unaligned put_be32() as the destination pointer 
is marked const.

As to the actual result on ARM... Well, the assembly _looks_ much worse 
than Linus' version.  It uses a stack frame of 152 bytes instead of 64 
bytes.  The resulting binary is also 6868 bytes large compared to 6180 
bytes.  Surprisingly, the performance is not that bad (the reason for 
the underlined "looks" above) albeit still a bit worse, like 5% slower.  
I was expecting much worse than that.

One possible reason for the bad assembly is probably due to the fact 
that gcc is not smart enough to propagate constant address offsets 
across different pointer types.  For example, my first version of 
get_be32() was a macro that did this:

#define SHA_SRC(t) \
  ({ unsigned char *__d = (unsigned char *)&data[t]; \
     (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })

With such a construct, gcc would always allocate a register to hold __d 
and then dereference that with an offset from 0 to 3.  Whereas:

#define SHA_SRC(t) \
   ({   unsigned char *__d = (unsigned char *)data; \
        (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
        (__d[(t)*4 + 2] <<  8) | (__d[(t)*4 + 3] <<  0); })

does produce optimal assembly as only the register holding the data 
pointer is dereferenced with the absolute byte offset.  I suspect your 
usage of inline functions has the same effect as the first SHA_SRC 
definition above.

Also, wrt skipping the last 3 write back to the 16 word array...  For 
all the (limited) attempts I've made so far to do that, it always ended 
up making things worse.  I've yet to investigate why though.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html