Re: [PATCH 0/7] block-sha1: improved SHA1 hashing

Artur Skawina <art.08.09@xxxxxxxxx> · Fri, 07 Aug 2009 03:30:36 +0200

Linus Torvalds wrote:
> 
> On Thu, 6 Aug 2009, Linus Torvalds wrote:
>> In particular, I'm thinking about the warnign in the intel optimization 
>> manual:
>>
>> 	The rotate by immediate and rotate by register instructions are 
>> 	more expensive than a shift. The rotate by 1 instruction has the 
>> 	same latency as a shift.
>>
>> so it's very possible that "rotate by 1" is much better than other 
>> rotates.
> 
> Hmm. Probably not. Googling more seems to indicate that rotates and shifts 
> have a fixed 4-cycle latency on Northwood. I'm not seeing anything that 
> indicates that a single-bit rotate/shift would be any faster.
> 
> (And remember, if 4 cycles doesn't sound so bad: that's enough of a 
> latency to do _16_ "simple" ALU's, since they can be double-pumped in the 
> two regular ALU's).

looking at the generated code, there is a lot of ro[rl] movement, so it's
likely that contributes to the problem.

I also see 44 extra lea instructions, 44 less adds and changes like:
        [...]
        mov    XX(%eRX),%eRX
        xor    XX(%eRX),%eRX
-       and    %eRX,%eRX
+       and    XX(%eRX),%eRX
        xor    XX(%eRX),%eRX
-       add    %eRX,%eRX
-       ror    $0x2,%eRX
-       mov    %eRX,XX(%eRX)
+       lea    (%eRX,%eRX,1),%eRX
        mov    XX(%eRX),%eRX
        bswap  %eRX
        mov    %eRX,XX(%eRX)
        mov    %eRX,%eRX
+       ror    $0x2,%eRX
+       mov    %eRX,XX(%eRX)
+       mov    %eRX,%eRX
        rol    $0x5,%eRX
        mov    XX(%eRX),%eRX
-       mov    XX(%eRX),%eRX
        [...]
which could mean that gcc did a better job of register allocation
(where "better job" might be just luck).

artur
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html