This is a refurbished series originally started by by Rik van Riel. The goal is load the FPU registers on return to userland and not on every context switch. By this optimisation we can: - avoid loading the registers if the task stays in kernel and does not return to userland - make kernel_fpu_begin() cheaper: it only saves the registers on the first invocation. The second invocation does not need save them again. To access the FPU registers in kernel we need: - disable preemption to avoid that the scheduler switches tasks. By doing so it would set TIF_LOAD_FPU and the FPU registers would be not valid. - disable BH because the softirq might use kernel_fpu_begin() and then set TIF_LOAD_FPU instead loading the FPU registers on completion. v1…v3: v2 was never posted. I followed the idea to completely decouple PKRU from xstate. This didn't quite work and made a few things complicated. One obvious required fixup is copy_fpstate_to_sigframe() where the PKRU state needs to be fiddled into xstate. This required another xfeatures_mask so that the sanity checks were performed and xstate_offsets would be computed. Additionally ptrace also reads/sets xstate in order to get/set the register and PKRU is one of them. So this would need some fiddle, too. In v3 I dropped that decouple idea. I also learned that the wrpkru instruction is not privileged and so caching it in kernel does not work. Instead I keep PKRU in xstate area and load it at context switch time while the remaining registers are deferred (until return to userland). The offset of PKRU within xstate is enumerated at boot time so why not use it. This seems to work with my in-kernel test case and a userland test case which use xmm registers. The pkey feature was tested in non kvm accelerated qemu and it seems to work, too. Sebastian