On Wed, Mar 05, 2025 at 10:07:45AM +0100, Ingo Molnar wrote: > > * Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > > From: Eric Biggers <ebiggers@xxxxxxxxxx> > > > > Currently kernel-mode FPU is not always usable in softirq context on > > x86, since softirqs can nest inside a kernel-mode FPU section in task > > context, and nested use of kernel-mode FPU is not supported. > > > > Therefore, x86 SIMD-optimized code that can be called in softirq context > > has to sometimes fall back to non-SIMD code. There are two options for > > the fallback, both of which are pretty terrible: > > > > (a) Use a scalar fallback. This can be 10-100x slower than vectorized > > code because it cannot use specialized instructions like AES, SHA, > > or carryless multiplication. > > > > (b) Execute the request asynchronously using a kworker. In other > > words, use the "crypto SIMD helper" in crypto/simd.c. > > > > Currently most of the x86 en/decryption code (skcipher and aead > > algorithms) uses option (b), since this avoids the slow scalar fallback > > and it is easier to wire up. But option (b) is still really bad for its > > own reasons: > > > > - Punting the request to a kworker is bad for performance too. > > > > - It forces the algorithm to be marked as asynchronous > > (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API > > users who request a synchronous algorithm. That's another huge > > performance problem, which is especially unfortunate for users who > > don't even do en/decryption in softirq context. > > > > - It makes all en/decryption operations take a detour through > > crypto/simd.c. That involves additional checks and an additional > > indirect call, which slow down en/decryption for *everyone*. > > > > Fortunately, the skcipher and aead APIs are only usable in task and > > softirq context in the first place. Thus, if kernel-mode FPU were to > > be reliably usable in softirq context, no fallback would be needed. > > Indeed, other architectures such as arm, arm64, and riscv have > > already done this. > > > > Therefore, this patch updates x86 accordingly to reliably support > > kernel-mode FPU in softirqs. > > > > This is done by just disabling softirq processing in kernel-mode FPU > > sections (when hardirqs are not already disabled), as that prevents the > > nesting that was problematic. > > > > This will delay some softirqs slightly, but only ones that would have > > otherwise been nested inside a task context kernel-mode FPU section. > > Any such softirqs would have taken the slow fallback path before if they > > tried to do any en/decryption. Now these softirqs will just run at the > > end of the task context kernel-mode FPU section (since local_bh_enable() > > runs pending softirqs) and will no longer take the slow fallback path. > > > > Alternatives considered: > > > > - Make kernel-mode FPU sections fully preemptible. This would require > > growing task_struct by another struct fpstate which is more than 2K. > > So that's something that will probably happen once the kernel is built > using APX anyway? The APX state is just 16 GPRs, for 128 bytes total. That's about 5% of the size of the fpstate (assuming AVX512 is supported). As Dave mentioned, for in-kernel use of APX it probably will make more sense to treat the new GPRs like the existing ones, instead of using XSTATE and integrating it with kernel-mode FPU. I.e., they will be saved/restored using plain moves to/from a dedicated buffer. > > > - Make softirqs save/restore the kernel-mode FPU state to a per-CPU > > struct fpstate when nested use is detected. Somewhat interesting, but > > seems unnecessary when a simpler solution exists. > > So: > > > void kernel_fpu_begin_mask(unsigned int kfpu_mask) > > { > > - preempt_disable(); > > + if (!irqs_disabled()) > > + fpregs_lock(); > > > + if (!irqs_disabled()) > > + fpregs_unlock(); > > So why is the irqs_disabled() check needed here? (On x86 it can be a > bit expensive at times, because the IRQ flag has to be loaded, > including all flags, so basically it's a soft synchronization point of > a sort.) > > Ie. why cannot we simply do a local_bh_disable()/enable() pair (on > !RT), ie. fpregs_lock()/fpregs_unlock()? > > local_bh_disable() is very similar in cost to preempt_disable(), both > are increasing the preempt_count. It's to keep kernel_fpu_begin()/end() working when hardirqs are disabled, since local_bh_disable()/enable() require that hardirqs be enabled. See the changelog and https://lore.kernel.org/r/20250228035924.GC5588@sol.localdomain/. There are other directions we could go, but this seems to be the simplest solution. If we forbid kernel_fpu_begin() with hardirqs disabled (as PS1 did), then a call to irqs_disabled() is still needed in irq_fpu_usable(). To avoid irqs_disabled() entirely, we'd need to avoid disabling softirqs, which would mean supporting nested kernel-mode FPU in softirqs. I can sent out a patch that does that using a per-CPU buffer, if you'd like to see that. I wasn't super happy with the extra edge cases and memory usage, but we could go in that direction. - Eric