Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> · Wed, 25 Jul 2018 11:54:53 +0200

On 25 July 2018 at 11:45, Dave Martin <Dave.Martin@xxxxxxx> wrote:
> On Wed, Jul 25, 2018 at 10:23:00AM +0100, Ard Biesheuvel wrote:
>> On 25 July 2018 at 11:09, Dave Martin <Dave.Martin@xxxxxxx> wrote:
>> > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote:
>> >> Vakul reports a considerable performance hit when running the accelerated
>> >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have
>> >> been updated to take the TIF_NEED_RESCHED flag into account.
>> >>
>> >> The issue appears to be caused by the fact that Cortex-A53, the core in
>> >> question, has a high end implementation of the Crypto Extensions, and
>> >> has a shallow pipeline, which means even sequential algorithms that may be
>> >> held back by pipeline stalls on high end out of order cores run at maximum
>> >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a
>> >> speed in the order of 2 to 4 cycles per byte, and are currently implemented
>> >> to check the TIF_NEED_RESCHED after each iteration, which may process as
>> >> little as 16 bytes (for GHASH).
>> >>
>> >> Obviously, every cycle of overhead hurts in this context, and given that
>> >> the A53's load/store unit is not quite high end, any delays caused by
>> >> memory accesses that occur in the inner loop of the algorithms are going
>> >> to be quite significant, hence the performance regression.
>> >>
>> >> So reduce the frequency at which the NEON yield checks are performed, so
>> >> that they occur roughly once every 1000 cycles, which is hopefully a
>> >> reasonable tradeoff between throughput and worst case scheduling latency.
>> >
>> > Is there any way to tune this automatically, or it that likely to be more
>> > trouble than it's worth?
>> >
>>
>> Good question. I think A53 is a reasonable worst case, and these
>> changes reduce the impact to the same ballpark as the impact of
>> enabling CONFIG_PREEMPT in the first place.
>>
>> > Also, how did you come up with 1000 cycles?  At what point does
>> > preemption latency become more/less important than throughput?
>> >
>>
>> Another good question. I was hoping Sebastian or the other -rt folks
>> would be triggered by this. Given the above, I ended up with a ~1000
>> cycles quantum, and hopefully this is considered to be small enough.
>>
>> > Maybe someone already invented a similar framework somewhere else in the
>> > kernel.  I seem to remember some automatic selection of memcpy
>> > implementation based on a boot-time benchmark, but I may have
>> > misremembered.
>> >
>>
>> We have crypto benchmarking code in the kernel, and at some point, I
>> even did some work on selecting the best algo based on performance.
>>
>> But to be honest, I think this is a bit overkill. If you need those
>> final 5% of throughput at any cost, you're better off running with
>> CONFIG_PREEMPT=n anyway.
>
> Can't really argue with any of that -- I was just wondering whether
> there was precedent.
>
> Hopefully the ~1000 cycles ballpark will satisfy most people.  For
> the rest, it's too bad: if somebody is relying on the last 1-2% of
> performance, they probably have a broken use case.
>

Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a
1000 cycle limit to the quantum of work performed with preemption
disabled is unreasonably low, we can increase the yield block counts
and approach the optimal numbers a bit closer. But with diminishing
returns.

Also, if the cost of enabling CONFIG_PREEMPT by itself is
significantly reduced, e.g., by moving the per-CPU offset into a GPR,
we can always revisit this of course.