Kevin, On Sat, Dec 11 2021 at 03:07, Kevin Tian wrote: >> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx> >> #NM in the guest is slow path, right? So why are you trying to optimize >> for it? > > This is really good information. The current logic is obviously > based on the assumption that #NM is frequently triggered. More context. When an application want's to use AMX, it invokes the prctl() which grants permission. If permission is granted then still the kernel FPU state buffers are default size and XFD is armed. When a thread of that process issues the first AMX (tile) instruction, then #NM is raised. The #NM handler does: 1) Read MSR_XFD_ERR. If 0, goto regular #NM 2) Write MSR_XFD_ERR to 0 3) Check whether the process has permission granted. If not, raise SIGILL and return. 4) Allocate and install a larger FPU state buffer for the task. If allocation fails, raise SIGSEGV and return. 5) Disarm XFD for that task That means one thread takes at max. one AMX/XFD related #NM during its lifetime, which means two VMEXITs. If there are other XFD controlled facilities in the future, then it will be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which uses them. Not the end of the world either. Looking at the targeted application space it's pretty unlikely that tasks which utilize AMX are going to be so short lived that the overhead of these VMEXITs really matters. This of course can be revisited when there is a sane use case, but optimizing for it prematurely does not buy us anything else than pointless complexity. >> The straight forward solution to this is: >> >> 1) Trap #NM and MSR_XFD_ERR write > > and #NM vmexit handler should be called in kvm_x86_handle_exit_irqoff() > before preemption is enabled, otherwise there is still a small window > where MSR_XFD_ERR might be clobbered after preemption enable and > before #NM handler is actually called. Yes. Thanks, tglx