2014-05-19 14:59 GMT+08:00 James Hogan <james.hogan@xxxxxxxxxx>: > On Monday 19 May 2014 11:14:07 chenj wrote: >> Computing sum introduces true data dependency, e.g. >> ADDC(sum, t0) >> ADDC(sum, t1) >> ADDC(sum, t2) >> ADDC(sum, t3) >> Here, each ADDC(sum, ...) references the sum value updated by previous ADDC. >> >> In this patch, above sequence is adjusted as following: >> ADDC(t0, t1) >> ADDC(t2, t3) >> ADDC(sum, t0) >> ADDC(sum, t2) >> The first two ADDC operations are independent, hence can be executed >> simultaneously if possible. > > The actual patch appears to change it to this: > ADDC(t0, t1) > ADDC(sum, t0) > ADDC(t2, t3) > ADDC(sum, t2) > > which is slightly different (presumably due to the interleaved stores in some > of the cases). > >> This patch improves instruction level parallelism, and brings at most 50% >> csum performance gain on Loongson 3a processor[1]. > > Nice results. > > The stuff below the --- will get dropped when the patch is applied though, > after which the "[1]" won't refer to anything. > Thanks for your suggestion, I'll amend the commit message further later. Basically, the patch reduces the case of one ADDC depending on the result of the previous ADDC. BTW, I'm not sure whether the sum value of the new implementation is equivalent to the original one, but in my test(make run_test of the csum_test.tar.gz, and a comparing patch in kernel) it is.