On 05/20/2014 01:09 PM, chenj wrote: > Computing sum introduces true data dependency. This patch removes some > true data depdendencies, hence instruction level parallelism is > improved. > > This patch brings at most 50% csum performance gain on Loongson 3a > processor in our test. > > One example about how this patch works is in CSUM_BIGCHUNK1: > // ** original ** vs ** patch applied ** > ADDC(sum, t0) ADDC(t0, t1) > ADDC(sum, t1) ADDC(t2, t3) > ADDC(sum, t2) ADDC(sum, t0) > ADDC(sum, t3) ADDC(sum, t2) > > In the original implementation, each ADDC(sum, ...) references the sum > value updated by previous ADDC. > > With patch applied, the first two ADDC operations are independent, > hence can be executed simultaneously if possible. > > Another example is in the "copy and sum calculating" chunk: > // ** original ** vs ** patch applied ** > STORE(t0, UNIT(0)... STORE(t0, UNIT(0)... > ADDC(sum, t0) ADDC(t0, t1) > STORE(t1, UNIT(1)... STORE(t1, UNIT(1)... > ADDC(sum, t1) ADDC(sum, t0) > STORE(t2, UNIT(2)... STORE(t2, UNIT(2)... > ADDC(sum, t2) ADDC(t2, t3) > STORE(t3, UNIT(3)... STORE(t3, UNIT(3)... > ADDC(sum, t3) ADDC(sum, t2) > > With patch applied, the second and third ADDC are independent. Hi chenj, You forgot to sign-off your patch -- markos