RE: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Sun, 12 Dec 2021 01:50:21 +0000

> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Sent: Saturday, December 11, 2021 9:29 PM
> 
> Kevin,
> 
> On Sat, Dec 11 2021 at 03:07, Kevin Tian wrote:
> >> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> >> #NM in the guest is slow path, right? So why are you trying to optimize
> >> for it?
> >
> > This is really good information. The current logic is obviously
> > based on the assumption that #NM is frequently triggered.
> 
> More context.
> 
> When an application want's to use AMX, it invokes the prctl() which
> grants permission. If permission is granted then still the kernel FPU
> state buffers are default size and XFD is armed.
> 
> When a thread of that process issues the first AMX (tile) instruction,
> then #NM is raised.
> 
> The #NM handler does:
> 
>     1) Read MSR_XFD_ERR. If 0, goto regular #NM
> 
>     2) Write MSR_XFD_ERR to 0
> 
>     3) Check whether the process has permission granted. If not,
>        raise SIGILL and return.
> 
>     4) Allocate and install a larger FPU state buffer for the task.
>        If allocation fails, raise SIGSEGV and return.
> 
>     5) Disarm XFD for that task
> 
> That means one thread takes at max. one AMX/XFD related #NM during its
> lifetime, which means two VMEXITs.
> 
> If there are other XFD controlled facilities in the future, then it will
> be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which
> uses
> them. Not the end of the world either.
> 
> Looking at the targeted application space it's pretty unlikely that
> tasks which utilize AMX are going to be so short lived that the overhead
> of these VMEXITs really matters.
> 
> This of course can be revisited when there is a sane use case, but
> optimizing for it prematurely does not buy us anything else than
> pointless complexity.

I get all above.

I guess the original open is also about the frequency of #NM not due 
to XFD. For Linux guest looks it's not a problem since CR0.TS is not set 
now when math emulation is not required:

DEFINE_IDTENTRY(exc_device_not_available)
{
	...
	/* This should not happen. */
	if (WARN(cr0 & X86_CR0_TS, "CR0.TS was set")) {
		/* Try to fix it up and carry on. */
		write_cr0(cr0 & ~X86_CR0_TS);
	} else {
		/*
		 * Something terrible happened, and we're better off trying
		 * to kill the task than getting stuck in a never-ending
		 * loop of #NM faults.
		 */
		die("unexpected #NM exception", regs, 0);
	}
}

It may affect guest which still uses CR0.TS to do lazy save. But likely
modern OSes all move to eager save approach so always trapping #NM
should be fine.

Is this understanding correct?

Thanks
Kevin