Re: [PATCH 0/7] block-sha1: improved SHA1 hashing

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 6 Aug 2009 18:55:19 -0700 (PDT)

On Fri, 7 Aug 2009, Artur Skawina wrote:
> 
> I also see 44 extra lea instructions, 44 less adds

add and lea (as long as the lea shift is 1) should be the same on a P4 
(they are not the same on some other microarchitectures and lea can have 
address generation stalls etc).

Lea, of course, gives the potential for register movement at the same time 
(three-address op), and that's likely the reason for lea-vs-adds.

> and changes like:
>         [...]
>         mov    XX(%eRX),%eRX
>         xor    XX(%eRX),%eRX
> -       and    %eRX,%eRX
> +       and    XX(%eRX),%eRX

Yeah, different spill patterns. That's the biggest issue, I think.

In particular, on P4, with unlucky spills, you may end up with things like

	ror $2,reg
	mov reg,x(%esp)
	.. a few instructions ..
	xor x(%esp), reg

and the above is exactly when one of the worst P4 problems hit: a store, 
followed a few cycles later by a load from the same address (and "a few 
cycles later" can be quite a few instructions if they are the nice ones).

What can happen is that if the store data isn't ready yet (because it 
comes from a long-latency op like a shift or a multiply), then you hit a 
store buffer replay thing. The P4 (with its long pipeline) basically 
starts the load speculatively, and if anything bad happens for the load 
(L1 cache miss, TLB miss, store buffer fault, you name it), it will cause 
a replay of the whole pipeline.

Which can take tens of cycles. 

[ That said, it's been a long time since I did a lot of P4 worrying. So I 
  may mis-remember the details. But that whole store buffer forwarding had 
  some really nasty replay issues ]

> which could mean that gcc did a better job of register allocation
> (where "better job" might be just luck).

I suspect that's the biggest issue. Just _happening_ to get the spills so 
that they don't hurt. And with unlucky scheduling, you might hit some of 
the P4 replay issues every single time.

There are some P4 optimizations that are simple:
 - avoid complex instructions
 - don't blow the trace cache
 - predictable branches
but the replay faults can really get you.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html