* Ingo Molnar <mingo@xxxxxxxxxx> wrote: > 1) > > Note how R12 is used immediately, right in the next instruction: > > vpaddq (TBL), Y_0, XFER > > I.e. the RBP fixes lengthen the program order data dependencies - that's a new > constraint and a few extra cycles per loop iteration if the workload is > address-generator bandwidth limited on that. > > A simple way to ease that constraint would be to move the 'TLB' load up into the > loop, body, to the point where 'T1' is used for the last time - which is: > > > mov a, T1 # T1 = a # MAJB > and c, T1 # T1 = a&c # MAJB > > add y0, y2 # y2 = S1 + CH # -- > or T1, y3 # y3 = MAJ = (a|c)&b)|(a&c) # MAJ > > + mov frame_TBL(%rsp), TBL > > add y1, h # h = k + w + h + S0 # -- > > add y2, d # d = k + w + h + d + S1 + CH = d + t1 # -- > > add y2, h # h = k + w + h + S0 + S1 + CH = t1 + S0# -- > add y3, h # h = t1 + S0 + MAJ # -- > > Note how this moves up the 'TLB' reload by 4 instructions. Note that in this case 'TBL' would have to be initialized before the 1st iteration, via something like: movq $4, frame_SRND(%rsp) + mov frame_TBL(%rsp), TBL .align 16 loop1: vpaddq (TBL), Y_0, XFER vmovdqa XFER, frame_XFER(%rsp) FOUR_ROUNDS_AND_SCHED Thanks, Ingo