Re: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> · Wed, 25 Jul 2018 10:09:01 +0200



On 25 July 2018 at 09:27, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote:
> (+ Mark)
>
> On 25 July 2018 at 08:57, Vakul Garg <vakul.garg@xxxxxxx> wrote:
>>
>>
>>> -----Original Message-----
>>> From: Ard Biesheuvel [mailto:ard.biesheuvel@xxxxxxxxxx]
>>> Sent: Tuesday, July 24, 2018 10:42 PM
>>> To: linux-crypto@xxxxxxxxxxxxxxx
>>> Cc: herbert@xxxxxxxxxxxxxxxxxxx; will.deacon@xxxxxxx;
>>> dave.martin@xxxxxxx; Vakul Garg <vakul.garg@xxxxxxx>;
>>> bigeasy@xxxxxxxxxxxxx; Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>
>>> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of
>>> NEON yield checks
>>>
>>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
>>> iteration of the GHASH and AES-GCM core routines is having a considerable
>>> performance impact on cores such as the Cortex-A53 with Crypto Extensions
>>> implemented.
>>>
>>> GHASH performance is down by 22% for large block sizes, and AES-GCM is
>>> down by 16% for large block sizes and 128 bit keys. This appears to be a
>>> result of the high performance of the crypto instructions on the one hand
>>> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the
>>> relatively poor load/store performance of this simple core.
>>>
>>> So let's reduce this performance impact by only doing the yield check once
>>> every 32 blocks for GHASH (or 4 when using the version based on 8-bit
>>> polynomial multiplication), and once every 16 blocks for AES-GCM.
>>> This way, we recover most of the performance while still limiting the
>>> duration of scheduling blackouts due to disabling preemption to ~1000
>>> cycles.
>>
>> I tested this patch. It helped but didn't regain the performance to previous level.
>> Are there more files remaining to be fixed? (In your original patch series for adding
>> preemptability check, there were lot more files changed than this series with 4 files).
>>
>> Instead of using hardcoded  32 block/16 block limit, should it be controlled using Kconfig?
>> I believe that on different cores, these values could be required to be different.
>>
>
> Simply enabling CONFIG_PREEMPT already causes a 8% performance hit on
> my 24xA53 system, probably because each per-CPU variable access
> involves disabling and re-enabling preemption, turning every per-CPU
> load into 2 loads and a store,

Actually, more like

load/store
load
load/store

so 3 loads and 2 stores.


> which hurts on this particular core.
> Mark and I have played around a bit with using a GPR to record the
> per-CPU offset, which would make this unnecessary, but this has its
> own set of problems so that is not expected to land any time soon.
>
> So if you care that much about squeezing the last drop of throughput
> out of your system without regard for worst case scheduling latency,
> disabling CONFIG_PREEMPT is a much better idea than playing around
> with tunables to tweak the maximum quantum of work that is executed
> with preemption disabled, especially since distro kernels will pick
> the default anyway.