Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

Ingo Molnar <mingo@xxxxxxxxxx> · Thu, 14 Sep 2017 11:28:57 +0200

* Ingo Molnar <mingo@xxxxxxxxxx> wrote:

> 1)
> 
> Note how R12 is used immediately, right in the next instruction:
> 
>         vpaddq  (TBL), Y_0, XFER
> 
> I.e. the RBP fixes lengthen the program order data dependencies - that's a new 
> constraint and a few extra cycles per loop iteration if the workload is 
> address-generator bandwidth limited on that.
> 
> A simple way to ease that constraint would be to move the 'TLB' load up into the 
> loop, body, to the point where 'T1' is used for the last time - which is:
> 
> 
>         mov     a, T1           # T1 = a                                # MAJB
>         and     c, T1           # T1 = a&c                              # MAJB
> 
>         add     y0, y2          # y2 = S1 + CH                          # --
>         or      T1, y3          # y3 = MAJ = (a|c)&b)|(a&c)             # MAJ
> 
> +       mov frame_TBL(%rsp), TBL
> 
>         add     y1, h           # h = k + w + h + S0                    # --
> 
>         add     y2, d           # d = k + w + h + d + S1 + CH = d + t1  # --
> 
>         add     y2, h           # h = k + w + h + S0 + S1 + CH = t1 + S0# --
>         add     y3, h           # h = t1 + S0 + MAJ                     # --
> 
> Note how this moves up the 'TLB' reload by 4 instructions.

Note that in this case 'TBL' would have to be initialized before the 1st 
iteration, via something like:

        movq    $4, frame_SRND(%rsp)

+	mov frame_TBL(%rsp), TBL

.align 16
loop1:
        vpaddq  (TBL), Y_0, XFER
        vmovdqa XFER, frame_XFER(%rsp)
        FOUR_ROUNDS_AND_SCHED

Thanks,

	Ingo