Re: [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs

David Laight <david.laight.linux@xxxxxxxxx> · Tue, 25 Feb 2025 22:21:33 +0000

On Fri, 21 Feb 2025 19:31:24 +0000
Eric Biggers <ebiggers@xxxxxxxxxx> wrote:

> Hi Xiao,
> 
> On Fri, Feb 21, 2025 at 03:38:27PM +0800, Xiao Liang wrote:
> > > Therefore, this patch updates x86 accordingly to reliably support
> > > kernel-mode FPU in softirqs (except when hardirqs are disabled).
> > >
> > > This is done by just disabling softirq processing in kernel-mode FPU
> > > sections, as that prevents the nesting that was problematic.
> > >
> > > This will delay some softirqs slightly, but only ones that would have
> > > otherwise been nested inside a task context kernel-mode FPU section.
> > > Any such softirqs would have taken the slow fallback path before if they
> > > tried to do any en/decryption.  Now these softirqs will just run at the
> > > end of the task context kernel-mode FPU section (since local_bh_enable()
> > > runs pending softirqs) and will no longer take the slow fallback path.  
> > 
> > I think this will delay all softirqs, including those that don't use
> > FPU. Will there be a performance impact?
> > (I guess you've noticed the patch I submitted last year. And this is
> > the main reason why it was implemented in the way you mentioned as
> > the second alternative.)  
> 
> Thanks for taking a look at this patch!  It's true that this patch makes all
> softirqs on the same CPU be delayed until the end of the current kernel-mode FPU
> section.  But, I'm a bit skeptical that it actually matters enough on x86 to go
> with a more complex solution that would allow nested kernel-mode FPU.
> Kernel-mode FPU sections are generally short; the usual cases are en/decrypting
> disk sectors or network packets that are 4 KiB or less.
> 
....
> I think it's also important to note that when local_bh_enable() re-enables
> softirq processing (when called from kernel_fpu_end()), it also immediatelly
> runs any pending softirqs.  Thus there would be no additional delay; the CPU
> will *immediately* run any pending softirqs.

I'd also have thought that anything time-critical shouldn't rely on softirq.
The network stack will run a lot of code in softirq context, a bit of time
with softirq disabled isn't going to make any difference to real-world latency.

I do wonder though whether the network napi code should be running in softint
context at all.
With the amount of data it is trivial to get through a single 'consumer' ethernet
interface it can easily cause the scheduler to mis-behave.
I'd guess that google (etc) use threaded napi, multiple rx queues and RFS to get
the network processing spread out and non contending with process code.

> 
> As for supporting nested kernel-mode FPU if we wanted to go that way: yes, your
> patch from last year
> https://lore.kernel.org/lkml/20240403140138.393825-1-shaw.leon@xxxxxxxxx/
> ostensibly did that.  However, I found some bugs in it; e.g., it didn't take
> into account that struct fpu is variable-length.  So it didn't turn out as
> simple as that patch made it seem.  Just extending fpregs_{lock,unlock}() to
> kernel-mode FPU is a simpler solution with fewer edge cases, and it avoids
> increasing the memory usage of the kernel.  So I thought I'd propose that first.

Since many kernel users don't want the traditional fpu, they just need to use
an instruction that requires an AVX register or two, is it possible for code
to specify a small save area for just two or four registers and then use just
those registers? (so treating then all as caller-saved).
I know that won't work with anything that affects the fpu status register,
but if you want a single wide register for a PCIe read (to generate a big TLP)
it is more than enough.

I'm sure there are horrid pitfalls, especially if IPI are still used to for
deferred save of fpu state.

	David