On Wed, Jul 25, 2018 at 10:23:00AM +0100, Ard Biesheuvel wrote: > On 25 July 2018 at 11:09, Dave Martin <Dave.Martin@xxxxxxx> wrote: > > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote: > >> Vakul reports a considerable performance hit when running the accelerated > >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have > >> been updated to take the TIF_NEED_RESCHED flag into account. > >> > >> The issue appears to be caused by the fact that Cortex-A53, the core in > >> question, has a high end implementation of the Crypto Extensions, and > >> has a shallow pipeline, which means even sequential algorithms that may be > >> held back by pipeline stalls on high end out of order cores run at maximum > >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a > >> speed in the order of 2 to 4 cycles per byte, and are currently implemented > >> to check the TIF_NEED_RESCHED after each iteration, which may process as > >> little as 16 bytes (for GHASH). > >> > >> Obviously, every cycle of overhead hurts in this context, and given that > >> the A53's load/store unit is not quite high end, any delays caused by > >> memory accesses that occur in the inner loop of the algorithms are going > >> to be quite significant, hence the performance regression. > >> > >> So reduce the frequency at which the NEON yield checks are performed, so > >> that they occur roughly once every 1000 cycles, which is hopefully a > >> reasonable tradeoff between throughput and worst case scheduling latency. > > > > Is there any way to tune this automatically, or it that likely to be more > > trouble than it's worth? > > > > Good question. I think A53 is a reasonable worst case, and these > changes reduce the impact to the same ballpark as the impact of > enabling CONFIG_PREEMPT in the first place. > > > Also, how did you come up with 1000 cycles? At what point does > > preemption latency become more/less important than throughput? > > > > Another good question. I was hoping Sebastian or the other -rt folks > would be triggered by this. Given the above, I ended up with a ~1000 > cycles quantum, and hopefully this is considered to be small enough. > > > Maybe someone already invented a similar framework somewhere else in the > > kernel. I seem to remember some automatic selection of memcpy > > implementation based on a boot-time benchmark, but I may have > > misremembered. > > > > We have crypto benchmarking code in the kernel, and at some point, I > even did some work on selecting the best algo based on performance. > > But to be honest, I think this is a bit overkill. If you need those > final 5% of throughput at any cost, you're better off running with > CONFIG_PREEMPT=n anyway. Can't really argue with any of that -- I was just wondering whether there was precedent. Hopefully the ~1000 cycles ballpark will satisfy most people. For the rest, it's too bad: if somebody is relying on the last 1-2% of performance, they probably have a broken use case. Cheers ---Dave