On Tue, May 03 2022 at 11:06, Peter Zijlstra wrote: > On Mon, May 02, 2022 at 05:58:40PM +0200, Thomas Gleixner wrote: >> Right, though currently it's guaranteed that softirq processing context >> can use the FPU. Quite some of the network crypto work runs in softirq >> context, so this might cause a regression. If so, then this needs to be >> an explicit commit on top which is easy to revert. Let me stare at it >> some more. > > Right, so with the: > > preempt_disable(); > this_cpu_write(fpu_in_use, true); > barrier(); > > sequence it is safe against both softirq and hardirq fpu usage. The only > concern is performance not correctness when dropping that > local_bh_disable() thing. > > So what Thomas proposes makes sense to me. Now I was looking at it the other way round too; i.e. to use local_bh_disable() for both fpregs_lock() and kernel_fpu_begin(). Using local_bh_disable() for both fpregs_lock() and kernel_fpu_begin() is not possible with the current constraints, because kernel_fpu_begin() can be called from hard interrupt context. But the only use case which utilizes FPU from hard interrupt context is the random generator via add_randomness_...(). I did a benchmark of these functions, which invoke blake2s_update() three times in a row, on a SKL-X and a ZEN3. The generic code and the FPU accelerated code are pretty much on par vs. execution time of the algorithm itself plus/minus noise. But in case that the interrupt hits a userspace task the FPU needs to be saved and if the interrupt does not result in rescheduling then the return to user space has to restore it. That's _expensive_ and the actual cost depends on the FPU state, but 200-300 cycles for save and 200-700 cycles for restore are due. Even if we ignore the save/restore part and assume that it averages out vs. schedule() having to save FPU state anyway, then there is another aspect to this: power consumption which affects also thermal budget and capacity. Though that made me more curious and I did the same comparison for crc32 which is heavily used by ext4. crc32c_pcl_intel_update() already contains a switch to software when the buffer length is less than 512 bytes. But even on larger buffers, typically ~4k, FPU is not necessarily a win. It's consistently slower by a factor of ~1.4x. And that's not due to xsave/rstor overhead because these computations run on a worker thread which does not do that dance at all. IOW, using the FPU blindly for this kind of computations is not necessarily a good plan. I have no idea how these things are analyzed and evaluated if at all. Maybe the crypto people can shed some light on this. Thanks, tglx