2015-03-31 4:10 GMT+08:00 Ralf Baechle <ralf@xxxxxxxxxxxxxx>: >> One example about how this patch works is in CSUM_BIGCHUNK1: >> // ** original ** vs ** patch applied ** >> ADDC(sum, t0) ADDC(t0, t1) >> ADDC(sum, t1) ADDC(t2, t3) >> ADDC(sum, t2) ADDC(sum, t0) >> ADDC(sum, t3) ADDC(sum, t2) >> >> With this patch applied, ADDC and the **next next** ADDC are independent. > > This is interesting because even CPUs as old as the R2000 have a pipeline > bypass which allows an instruction to use a result written to a register > by an immediately preceeeding instruction. But if removes some dependency(as the patch did), instruction A and instruction B can be issued at the same cycle[1], instead of B waiting for the result from A (a pipeline bypass reduces the wait time, but not eliminates it, right?) > > Can you explain why this patch is so beneficial for Loongson 3A? I have written a simply test[2] to measure the performance gain on Loongson 3A, the result[3] shows at most 50% performance gain. IMHO, the patch not only benefits Loongson 3A, but would also benefit other MIPS CPU(s). -- 1. If the hardware supports this, e.g. at least two ALU units for ALU operations, and is an out of order execution pipeline, etc 2. http://dev.lemote.com/files/upload/software/csum-opti/csum-test.tar.gz 3. http://dev.lemote.com/files/upload/software/csum-opti/csum-opti-benchmark.html Regards, - cee1