On Fri, Mar 27, 2015 at 01:07:24AM +0800, cee1 wrote: > From: Chen Jie <chenj@xxxxxxxxxx> > > Computing sum introduces true data dependency. This patch removes some > true data depdendencies, hence increases instruction level parallelism. > > This patch brings at most 50% csum performance gain on Loongson 3a > processor in our test. > > One example about how this patch works is in CSUM_BIGCHUNK1: > // ** original ** vs ** patch applied ** > ADDC(sum, t0) ADDC(t0, t1) > ADDC(sum, t1) ADDC(t2, t3) > ADDC(sum, t2) ADDC(sum, t0) > ADDC(sum, t3) ADDC(sum, t2) > > In the original implementation, each ADDC(sum, ...) depends on the sum > value updated by previous ADDC(as source operand). > > With this patch applied, the first two ADDC operations are independent, > hence can be executed simultaneously if possible. > > Another example is in the "copy and sum calculating chunk": > // ** original ** vs ** patch applied ** > STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ... > ADDC(sum, t0) ADDC(t0, t1) > STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ... > ADDC(sum, t1) ADDC(sum, t0) > STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ... > ADDC(sum, t2) ADDC(t2, t3) > STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ... > ADDC(sum, t3) ADDC(sum, t2) > > With this patch applied, ADDC and the **next next** ADDC are independent. This is interesting because even CPUs as old as the R2000 have a pipeline bypass which allows an instruction to use a result written to a register by an immediately preceeeding instruction. Can you explain why this patch is so beneficial for Loongson 3A? Thanks, Ralf