From: Ard Biesheuvel > Sent: 21 April 2020 08:02 > On Tue, 21 Apr 2020 at 06:15, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > > > > Hi David, > > > > On Mon, Apr 20, 2020 at 2:32 AM David Laight <David.Laight@xxxxxxxxxx> wrote: > > > Maybe kernel_fp_begin() should be passed the address of somewhere > > > the address of an fpu save area buffer can be written to. > > > Then the pre-emption code can allocate the buffer and save the > > > state into it. > > > > Interesting idea. It looks like `struct xregs_state` is only 576 > > bytes. That's not exactly small, but it's not insanely huge either, > > and maybe we could justifiably stick that on the stack, or even > > reserve part of the stack allocation for that that the function would > > know about, without needing to specify any address. > > > > > kernel_fpu_begin() ought also be passed a parameter saying which > > > fpu features are required, and return which are allocated. > > > On x86 this could be used to check for AVX512 (etc) which may be > > > available in an ISR unless it interrupted inside a kernel_fpu_begin() > > > section (etc). > > > It would also allow optimisations if only 1 or 2 fpu registers are > > > needed (eg for some of the crypto functions) rather than the whole > > > fpu register set. > > > > For AVX512 this probably makes sense, I suppose. But I'm not sure if > > there are too many bits of crypto code that only use a few registers. > > There are those accelerated memcpy routines in i915 though -- ever see > > drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go > > this way, I wonder if it'd make sense to totally overengineer it and > > write a gcc/as plugin to create the register mask for us. Or, maybe > > some checker inside of objtool could help here. > > > > Actually, though, the thing I've been wondering about is actually > > moving in the complete opposite direction: is there some > > efficient-enough way that we could allow FPU registers in all contexts > > always, without the need for kernel_fpu_begin/end? I was reversing > > ntoskrnl.exe and was kind of impressed (maybe not the right word?) by > > their judicious use of vectorisation everywhere. I assume a lot of > > that is being generated by their compiler, which of course gcc could > > do for us if we let it. Is that an interesting avenue to consider? Or > > are you pretty certain that it'd be a huge mistake, with an > > irreversible speed hit? > > > > When I added support for kernel mode SIMD to arm64 originally, there > was a kernel_neon_begin_partial() that took an int to specify how many > registers were being used, the reason being that NEON preserve/store > was fully eager at this point, and arm64 has 32 SIMD registers, many > of which weren't really used, e.g., in the basic implementation of AES > based on special instructions. > > With the introduction of lazy restore, and SVE handling for userspace, > we decided to remove this because it didn't really help anymore, and > made the code more difficult to manage. > > However, I think it would make sense to have something like this in > the general case. I.e., NEON registers 0-3 are always preserved when > an exception or interrupt (or syscall) is taken, and so they can be > used anywhere in the kernel. If you want the whole set, you will have > to use begin/end as before. This would already unlock a few > interesting case, like memcpy, xor, and sequences that can easily be > implemented with only a few registers such as instructio based AES. > > Unfortunately, the compiler needs to be taught about this to be > completely useful, which means lots of prototyping and benchmarking > upfront, as the contract will be set in stone once the compilers get > on board. You can always just use asm with explicit registers. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)