> > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption > > Perhaps we should try a different approach. How about just limiting > > the size to 4K, and then depending on need_resched we break out of > > the loop? Something like: > > > > if (!len) > > return 0; > > > > kernel_fpu_begin(); > > for (;;) { > > unsigned int chunk = min(len, 4096); > > > > sha1_base_do_update(desc, data, chunk, sha1_xform); > > > > len -= chunk; > > data += chunk; > > > > if (!len) > > break; > > > > if (need_resched()) { > > kernel_fpu_end(); > > cond_resched(); > > kernel_fpu_begin(); > > } > > } > > kernel_fpu_end(); > > > I implemented that conditional approach in the sha algorithms. > > The results of a boot (using sha512 for module signatures, with > crypto extra tests enabled, comparing to sha512 with a 20 KiB > fixed limit) are: > > sha1 cond: 14479 calls; 784256 cycles doing begin/end; longest FPU context 35828 cycles > sha256 cond: 26763 calls; 1273570 cycles doing begin/end; longest FPU context 118612 cycles > sha512 cond: 26957 calls; 1680046 cycles doing begin/end; longest FPU context 169140982 cycles > sha512 20KiB: 161011 calls; 16232280 cycles doing begin/end; longest FPU context 4049644 cycles > > NOTE: I didn't have a patch in place to isolate the counts for each variation > (ssse3 vs. avx vs. avx2) and > - for sha512: sha512 vs. sha384 > - for sha256: sha256 vs. sha224 > so the numbers include sha256 and sha512 running twice as many tests > as sha1. > > This approach looks very good: > - 16% of the number of begin/end calls > - 10% of the CPU cycles spent making the calls > - the FPU context is held for a long time (77 ms) but only while > it's not needed. > > That's much more efficient than releasing it every 30 us just in case. How recently did you make this change? I implemented this conditional approach for ecb_cbc_helpers.h, but saw no changes at all to performance on serpent-avx2 and twofish-avx. kernel_fpu_{begin,end} (after the first call to begin) don't do anything more than enable/disable preemption and make a few writes to the mxcsr. It's likely that the above approach has the tiniest bit less overhead, and it will preempt on non CONFIG_PREEMPT kernels, but nothing suggests a performance uplift. This brings us back to this question: should crypto routines be preempted under PREEMPT_VOLUNTARY or not? > I'll keep testing this to make sure RCU stalls stay away, and apply > the approach to the other algorithms. I missed the earlier discussions. Have you seen issues with RCU stalls/latency spikes because of crypto routines? If so, what preemption model were you running? > In x86, need_resched() has to deal with a PER_CPU variable, so I'm > not sure it's worth the hassle to figure out how to do that from > assembly code. Leave it in c. It'll be more maintainable that way. Cheers, Peter Lafreniere <peter@xxxxxxxx>