Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 25, 2018 at 10:11:42AM +0100, Ard Biesheuvel wrote:
> On 25 July 2018 at 11:05, Dave Martin <Dave.Martin@xxxxxxx> wrote:
> > On Tue, Jul 24, 2018 at 06:12:21PM +0100, Ard Biesheuvel wrote:
> >> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
> >> iteration of the GHASH and AES-GCM core routines is having a considerable
> >> performance impact on cores such as the Cortex-A53 with Crypto Extensions
> >> implemented.
> >>
> >> GHASH performance is down by 22% for large block sizes, and AES-GCM is
> >> down by 16% for large block sizes and 128 bit keys. This appears to be
> >> a result of the high performance of the crypto instructions on the one
> >> hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with
> >> the relatively poor load/store performance of this simple core.
> >>
> >> So let's reduce this performance impact by only doing the yield check
> >> once every 32 blocks for GHASH (or 4 when using the version based on
> >> 8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.
> >> This way, we recover most of the performance while still limiting the
> >> duration of scheduling blackouts due to disabling preemption to ~1000
> >> cycles.
> >>
> >> Cc: Vakul Garg <vakul.garg@xxxxxxx>
> >> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>
> >> ---
> >>  arch/arm64/crypto/ghash-ce-core.S | 12 +++++++++---
> >>  1 file changed, 9 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
> >> index dcffb9e77589..9c14beaabeee 100644
> >> --- a/arch/arm64/crypto/ghash-ce-core.S
> >> +++ b/arch/arm64/crypto/ghash-ce-core.S
> >> @@ -212,7 +212,7 @@
> >>       ushr            XL.2d, XL.2d, #1
> >>       .endm
> >>
> >> -     .macro          __pmull_ghash, pn
> >> +     .macro          __pmull_ghash, pn, yield_count
> >>       frame_push      5
> >>
> >>       mov             x19, x0
> >> @@ -259,6 +259,9 @@ CPU_LE(   rev64           T1.16b, T1.16b  )
> >>       eor             T2.16b, T2.16b, XH.16b
> >>       eor             XL.16b, XL.16b, T2.16b
> >>
> >> +     tst             w19, #(\yield_count - 1)
> >
> > This should perhaps be (\yield_count) - 1.
> >
> > It would be a bit silly to pass a non-trivial expression for yield_count
> > though.
> >
> >> +     b.ne            1b
> >> +
> >
> > Is it worth having a build-time check that \yield_count is a power of two?
> > (i.e., (\yield_count) & ((\yield_count) - 1) != 0).  We could have a
> > generic macro for that.
> >
> > Otherwise this code may poll more often than expected.  Not the end of
> > the world, though.
> >
> 
> Thanks for taking a look.
> 
> Given that the macro in question is used in exactly two places in the
> same file, and is unlikely to be reused unless the architecture gains
> support for another optional instruction that can be used as a drop-in
> replacement, I don't think any of this truly matters tbh.

Fair enough.  A comment on the macro definition might help, but beyond
that it is probably overkill.

Cheers
---Dave



[Index of Archives]     [Kernel]     [Gnu Classpath]     [Gnu Crypto]     [DM Crypt]     [Netfilter]     [Bugtraq]

  Powered by Linux