On Monday 19 May 2014 11:14:07 chenj wrote: > Computing sum introduces true data dependency, e.g. > ADDC(sum, t0) > ADDC(sum, t1) > ADDC(sum, t2) > ADDC(sum, t3) > Here, each ADDC(sum, ...) references the sum value updated by previous ADDC. > > In this patch, above sequence is adjusted as following: > ADDC(t0, t1) > ADDC(t2, t3) > ADDC(sum, t0) > ADDC(sum, t2) > The first two ADDC operations are independent, hence can be executed > simultaneously if possible. The actual patch appears to change it to this: ADDC(t0, t1) ADDC(sum, t0) ADDC(t2, t3) ADDC(sum, t2) which is slightly different (presumably due to the interleaved stores in some of the cases). > This patch improves instruction level parallelism, and brings at most 50% > csum performance gain on Loongson 3a processor[1]. Nice results. The stuff below the --- will get dropped when the patch is applied though, after which the "[1]" won't refer to anything. Cheers James > > --- > 1. The result can be found at > http://dev.lemote.com/files/upload/software/csum-opti/csum-opti-benchmark.ht > ml And is generated by a userspace test program: > http://dev.lemote.com/files/upload/software/csum-opti/csum-test.tar.gz > > [v2: amend commit message] > > arch/mips/lib/csum_partial.S | 38 +++++++++++++++++++------------------- > 1 file changed, 19 insertions(+), 19 deletions(-) > > diff --git a/arch/mips/lib/csum_partial.S b/arch/mips/lib/csum_partial.S > index 9901237..6cea101 100644 > --- a/arch/mips/lib/csum_partial.S > +++ b/arch/mips/lib/csum_partial.S > @@ -76,10 +76,10 @@ > LOAD _t1, (offset + UNIT(1))(src); \ > LOAD _t2, (offset + UNIT(2))(src); \ > LOAD _t3, (offset + UNIT(3))(src); \ > + ADDC(_t0, _t1); \ > + ADDC(_t2, _t3); \ > ADDC(sum, _t0); \ > - ADDC(sum, _t1); \ > - ADDC(sum, _t2); \ > - ADDC(sum, _t3) > + ADDC(sum, _t2) > > #ifdef USE_DOUBLE > #define CSUM_BIGCHUNK(src, offset, sum, _t0, _t1, _t2, _t3) \ > @@ -501,21 +501,21 @@ LEAF(csum_partial) > SUB len, len, 8*NBYTES > ADD src, src, 8*NBYTES > STORE(t0, UNIT(0)(dst), .Ls_exc\@) > - ADDC(sum, t0) > + ADDC(t0, t1) > STORE(t1, UNIT(1)(dst), .Ls_exc\@) > - ADDC(sum, t1) > + ADDC(sum, t0) > STORE(t2, UNIT(2)(dst), .Ls_exc\@) > - ADDC(sum, t2) > + ADDC(t2, t3) > STORE(t3, UNIT(3)(dst), .Ls_exc\@) > - ADDC(sum, t3) > + ADDC(sum, t2) > STORE(t4, UNIT(4)(dst), .Ls_exc\@) > - ADDC(sum, t4) > + ADDC(t4, t5) > STORE(t5, UNIT(5)(dst), .Ls_exc\@) > - ADDC(sum, t5) > + ADDC(sum, t4) > STORE(t6, UNIT(6)(dst), .Ls_exc\@) > - ADDC(sum, t6) > + ADDC(t6, t7) > STORE(t7, UNIT(7)(dst), .Ls_exc\@) > - ADDC(sum, t7) > + ADDC(sum, t6) > .set reorder /* DADDI_WAR */ > ADD dst, dst, 8*NBYTES > bgez len, 1b > @@ -541,13 +541,13 @@ LEAF(csum_partial) > SUB len, len, 4*NBYTES > ADD src, src, 4*NBYTES > STORE(t0, UNIT(0)(dst), .Ls_exc\@) > - ADDC(sum, t0) > + ADDC(t0, t1) > STORE(t1, UNIT(1)(dst), .Ls_exc\@) > - ADDC(sum, t1) > + ADDC(sum, t0) > STORE(t2, UNIT(2)(dst), .Ls_exc\@) > - ADDC(sum, t2) > + ADDC(t2, t3) > STORE(t3, UNIT(3)(dst), .Ls_exc\@) > - ADDC(sum, t3) > + ADDC(sum, t2) > .set reorder /* DADDI_WAR */ > ADD dst, dst, 4*NBYTES > beqz len, .Ldone\@ > @@ -646,13 +646,13 @@ LEAF(csum_partial) > nop # improves slotting > #endif > STORE(t0, UNIT(0)(dst), .Ls_exc\@) > - ADDC(sum, t0) > + ADDC(t0, t1) > STORE(t1, UNIT(1)(dst), .Ls_exc\@) > - ADDC(sum, t1) > + ADDC(sum, t0) > STORE(t2, UNIT(2)(dst), .Ls_exc\@) > - ADDC(sum, t2) > + ADDC(t2, t3) > STORE(t3, UNIT(3)(dst), .Ls_exc\@) > - ADDC(sum, t3) > + ADDC(sum, t2) > .set reorder /* DADDI_WAR */ > ADD dst, dst, 4*NBYTES > bne len, rem, 1b
Attachment:
signature.asc
Description: This is a digitally signed message part.