On 5/12/21 12:41 AM, Peter Zijlstra wrote: > On Tue, May 11, 2021 at 01:05:02PM -0400, Jon Kohler wrote: >> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h >> index 8d33ad80704f..5bc4df3a4c27 100644 >> --- a/arch/x86/include/asm/fpu/internal.h >> +++ b/arch/x86/include/asm/fpu/internal.h >> @@ -583,7 +583,13 @@ static inline void switch_fpu_finish(struct fpu *new_fpu) >> if (pk) >> pkru_val = pk->pkru; >> } >> - __write_pkru(pkru_val); >> + >> + /* >> + * WRPKRU is relatively expensive compared to RDPKRU. >> + * Avoid WRPKRU when it would not change the value. >> + */ >> + if (pkru_val != rdpkru()) >> + wrpkru(pkru_val); > Just wondering; why aren't we having that in a per-cpu variable? The > usual per-cpu MSR shadow approach avoids issuing any 'special' ops > entirely. It could be a per-cpu variable. When I wrote this originally I figured that a rdpkru would be cheaper than a load from memory (even per-cpu memory). But, now that I think about it, assuming that 'prku_val' is in %rdi, doing: cmp %gs:0x1234, %rdi might end up being cheaper than clobbering a *pair* of GPRs with rdpkru: xor %ecx,%ecx rdpkru cmp %rax, %rdi I'm too lazy to go figure out what would be faster in practice, though. Does anyone care?