RE: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption

Peter Lafreniere <peter@xxxxxxxx> · Tue, 06 Dec 2022 23:06:27 +0000

> > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
> > Perhaps we should try a different approach. How about just limiting
> > the size to 4K, and then depending on need_resched we break out of
> > the loop? Something like:
> > 
> > if (!len)
> > return 0;
> > 
> > kernel_fpu_begin();
> > for (;;) {
> > unsigned int chunk = min(len, 4096);
> > 
> > sha1_base_do_update(desc, data, chunk, sha1_xform);
> > 
> > len -= chunk;
> > data += chunk;
> > 
> > if (!len)
> > break;
> > 
> > if (need_resched()) {
> > kernel_fpu_end();
> > cond_resched();
> > kernel_fpu_begin();
> > }
> > }
> > kernel_fpu_end();
> 
> 
> I implemented that conditional approach in the sha algorithms.
> 
> The results of a boot (using sha512 for module signatures, with
> crypto extra tests enabled, comparing to sha512 with a 20 KiB
> fixed limit) are:
> 
> sha1 cond: 14479 calls; 784256 cycles doing begin/end; longest FPU context 35828 cycles
> sha256 cond: 26763 calls; 1273570 cycles doing begin/end; longest FPU context 118612 cycles
> sha512 cond: 26957 calls; 1680046 cycles doing begin/end; longest FPU context 169140982 cycles
> sha512 20KiB: 161011 calls; 16232280 cycles doing begin/end; longest FPU context 4049644 cycles
> 
> NOTE: I didn't have a patch in place to isolate the counts for each variation
> (ssse3 vs. avx vs. avx2) and
> - for sha512: sha512 vs. sha384
> - for sha256: sha256 vs. sha224
> so the numbers include sha256 and sha512 running twice as many tests
> as sha1.
> 
> This approach looks very good:
> - 16% of the number of begin/end calls
> - 10% of the CPU cycles spent making the calls
> - the FPU context is held for a long time (77 ms) but only while
> it's not needed.
> 
> That's much more efficient than releasing it every 30 us just in case.

How recently did you make this change? I implemented this conditional 
approach for ecb_cbc_helpers.h, but saw no changes at all to performance 
on serpent-avx2 and twofish-avx.

kernel_fpu_{begin,end} (after the first call to begin) don't do anything 
more than enable/disable preemption and make a few writes to the mxcsr. 
It's likely that the above approach has the tiniest bit less overhead, 
and it will preempt on non CONFIG_PREEMPT kernels, but nothing suggests 
a performance uplift.

This brings us back to this question: should crypto routines be 
preempted under PREEMPT_VOLUNTARY or not?

> I'll keep testing this to make sure RCU stalls stay away, and apply
> the approach to the other algorithms.

I missed the earlier discussions. Have you seen issues with RCU 
stalls/latency spikes because of crypto routines? If so, what preemption 
model were you running?

> In x86, need_resched() has to deal with a PER_CPU variable, so I'm
> not sure it's worth the hassle to figure out how to do that from
> assembly code.

Leave it in c. It'll be more maintainable that way.

Cheers,
Peter Lafreniere <peter@xxxxxxxx>