On 25 July 2018 at 09:27, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: > (+ Mark) > > On 25 July 2018 at 08:57, Vakul Garg <vakul.garg@xxxxxxx> wrote: >> >> >>> -----Original Message----- >>> From: Ard Biesheuvel [mailto:ard.biesheuvel@xxxxxxxxxx] >>> Sent: Tuesday, July 24, 2018 10:42 PM >>> To: linux-crypto@xxxxxxxxxxxxxxx >>> Cc: herbert@xxxxxxxxxxxxxxxxxxx; will.deacon@xxxxxxx; >>> dave.martin@xxxxxxx; Vakul Garg <vakul.garg@xxxxxxx>; >>> bigeasy@xxxxxxxxxxxxx; Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> >>> Subject: [PATCH 1/4] crypto/arm64: ghash - reduce performance impact of >>> NEON yield checks >>> >>> As reported by Vakul, checking the TIF_NEED_RESCHED flag after every >>> iteration of the GHASH and AES-GCM core routines is having a considerable >>> performance impact on cores such as the Cortex-A53 with Crypto Extensions >>> implemented. >>> >>> GHASH performance is down by 22% for large block sizes, and AES-GCM is >>> down by 16% for large block sizes and 128 bit keys. This appears to be a >>> result of the high performance of the crypto instructions on the one hand >>> (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with the >>> relatively poor load/store performance of this simple core. >>> >>> So let's reduce this performance impact by only doing the yield check once >>> every 32 blocks for GHASH (or 4 when using the version based on 8-bit >>> polynomial multiplication), and once every 16 blocks for AES-GCM. >>> This way, we recover most of the performance while still limiting the >>> duration of scheduling blackouts due to disabling preemption to ~1000 >>> cycles. >> >> I tested this patch. It helped but didn't regain the performance to previous level. >> Are there more files remaining to be fixed? (In your original patch series for adding >> preemptability check, there were lot more files changed than this series with 4 files). >> >> Instead of using hardcoded 32 block/16 block limit, should it be controlled using Kconfig? >> I believe that on different cores, these values could be required to be different. >> > > Simply enabling CONFIG_PREEMPT already causes a 8% performance hit on > my 24xA53 system, probably because each per-CPU variable access > involves disabling and re-enabling preemption, turning every per-CPU > load into 2 loads and a store, Actually, more like load/store load load/store so 3 loads and 2 stores. > which hurts on this particular core. > Mark and I have played around a bit with using a GPR to record the > per-CPU offset, which would make this unnecessary, but this has its > own set of problems so that is not expected to land any time soon. > > So if you care that much about squeezing the last drop of throughput > out of your system without regard for worst case scheduling latency, > disabling CONFIG_PREEMPT is a much better idea than playing around > with tunables to tweak the maximum quantum of work that is executed > with preemption disabled, especially since distro kernels will pick > the default anyway.