On Tue, 31 Mar 2015, cee1 wrote: > >> One example about how this patch works is in CSUM_BIGCHUNK1: > >> // ** original ** vs ** patch applied ** > >> ADDC(sum, t0) ADDC(t0, t1) > >> ADDC(sum, t1) ADDC(t2, t3) > >> ADDC(sum, t2) ADDC(sum, t0) > >> ADDC(sum, t3) ADDC(sum, t2) > >> > >> With this patch applied, ADDC and the **next next** ADDC are independent. > > > > This is interesting because even CPUs as old as the R2000 have a pipeline > > bypass which allows an instruction to use a result written to a register > > by an immediately preceeeding instruction. > > But if removes some dependency(as the patch did), instruction A and > instruction B can be issued at the same cycle[1], instead of B waiting > for the result from A (a pipeline bypass reduces the wait time, but > not eliminates it, right?) Hmm, that sounds to me remarkably like the scenario with Intel's original Pentium processor that had a dual issue pipeline with U and V execution pipes, both of which accepted ALU operations, and then each had some further constraints as to other instructions, some of which had to go to a specific pipe of the two (and were still parallelised if the other instruction was acceptable for the other pipe). To get good performance out of that design you had to interleave ALU operations so that there was no data dependency between two consecutive instructions, in which case two instructions could have been issued and retired at a time, in parallel. The further constraints the U and V pipes had with other instructions made instruction scheduling quite an interesting challenge for the compiler or handcoded assembly. With the more complex pipeline design the Pentium's successor Pentium Pro had there was no longer such an issue, I reckon there were several mechanisms involved including register renaming and speculative execution of more than just two instructions ahead that eliminated the need of such constrained instruction scheduling although I don't remember offhand how all this worked. > > Can you explain why this patch is so beneficial for Loongson 3A? > > I have written a simply test[2] to measure the performance gain on > Loongson 3A, the result[3] shows at most 50% performance gain. > > IMHO, the patch not only benefits Loongson 3A, but would also benefit > other MIPS CPU(s). I'm not sure if any such other superscalar MIPS pipeline implementation exists, but if written correctly then at worst it won't hurt anyone else, so just make sure your change does not regress scalar MIPS pipelines. I hope you have a way to verify it. Maciej