Re: [v5] MIPS: lib: csum_partial: more instruction paral

"Maciej W. Rozycki" <macro@xxxxxxxxxxxxxx> · Thu, 2 Apr 2015 13:59:27 +0100 (BST)

On Tue, 31 Mar 2015, cee1 wrote:

> >> One example about how this patch works is in CSUM_BIGCHUNK1:
> >> // ** original **    vs    ** patch applied **
> >>     ADDC(sum, t0)           ADDC(t0, t1)
> >>     ADDC(sum, t1)           ADDC(t2, t3)
> >>     ADDC(sum, t2)           ADDC(sum, t0)
> >>     ADDC(sum, t3)           ADDC(sum, t2)
> >>
> >> With this patch applied, ADDC and the **next next** ADDC are independent.
> >
> > This is interesting because even CPUs as old as the R2000 have a pipeline
> > bypass which allows an instruction to use a result written to a register
> > by an immediately preceeeding instruction.
> 
> But if removes some dependency(as the patch did), instruction A and
> instruction B can be issued at the same cycle[1], instead of B waiting
> for the result from A   (a pipeline bypass reduces the wait time, but
> not eliminates it, right?)

 Hmm, that sounds to me remarkably like the scenario with Intel's original 
Pentium processor that had a dual issue pipeline with U and V execution 
pipes, both of which accepted ALU operations, and then each had some 
further constraints as to other instructions, some of which had to go to a 
specific pipe of the two (and were still parallelised if the other 
instruction was acceptable for the other pipe).

 To get good performance out of that design you had to interleave ALU 
operations so that there was no data dependency between two consecutive 
instructions, in which case two instructions could have been issued and 
retired at a time, in parallel.  The further constraints the U and V pipes 
had with other instructions made instruction scheduling quite an 
interesting challenge for the compiler or handcoded assembly.

 With the more complex pipeline design the Pentium's successor Pentium Pro 
had there was no longer such an issue, I reckon there were several 
mechanisms involved including register renaming and speculative execution 
of more than just two instructions ahead that eliminated the need of such 
constrained instruction scheduling although I don't remember offhand how 
all this worked.

> > Can you explain why this patch is so beneficial for Loongson 3A?
> 
> I have written a simply test[2] to measure the performance gain on
> Loongson 3A, the result[3] shows at most 50% performance gain.
> 
> IMHO, the patch not only benefits Loongson 3A, but would also benefit
> other MIPS CPU(s).

 I'm not sure if any such other superscalar MIPS pipeline implementation 
exists, but if written correctly then at worst it won't hurt anyone else, 
so just make sure your change does not regress scalar MIPS pipelines.  I 
hope you have a way to verify it.

  Maciej