Re: [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 5 Mar 2024 07:39:45 -0800

On Tue, Mar 05, 2024, Oliver Upton wrote:
> On Mon, Mar 04, 2024 at 02:49:07PM -0800, Sean Christopherson wrote:
> 
> [...]
> 
> > The presense of MTE stuff shouldn't affect the fundamental access information,
> 
>   "When FEAT_MTE is implemented, for a synchronous Data Abort on an
>   instruction that directly accesses Allocation Tags, ISV is 0."
> 
> If there is no instruction syndrome, there's insufficient fault context
> to determine if the guest was doing a read or a write.
> 
> > e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE
> > irrespective of whether or not MTE is in play.
> 
> When the MMU generates such an abort, it *is not* read, write, or execute.
> It is a NoTagAccess fault. There is no sane way to describe this in
> terms of RWX.

RWX=0, with KVM_MEMORY_EXIT_FLAG_MTE seems like a reasonable way to communicate
that, no?

> > > > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for
> > > > "caching" emulated MMIO accesses, and the resulting fault captures the
> > > > "reserved bits set" information in register state.  But that's purely an
> > > > (optional) imlementation detail of KVM that should never be exposed to
> > > > userspace.
> > > 
> > > MMIO accesses would show up elsewhere though, right?
> > 
> > Yes, but I don't see how that's relevant.  Maybe I'm just misunderstanding what
> > you're saying/asking.
> 
> If "reserved" EPT violations found their way to userspace via the
> "memory fault" exit structure then that'd likely be due to a KVM bug.

Heh, sadly no.  Userspace can convert a GFN to private at any time, and the
TDX and SNP specs allow for implicit converstion "requests" from the guest, i.e.
KVM isn't allowed to kill the guest if the guest accesses a GFN with the "wrong"
private/shared attribute.

This particular case would likely be hit only if there's a userspace and/or guest
bug, but whether the setup is broken or just unique isn't KVM's call to make.

> The only expected flows in the near term are this and CoCo crap.
> 
> > > Either way, I have no issues whatsoever if the direction for x86 is to
> > > provide abstracted fault information.
> > 
> > I don't understand how ARM can get away with NOT providing a layer of abstraction.
> > Copying fault state verbatim to userspace will bleed KVM implementation details
> > into userspace,
> 
> The memslot flag already bleeds KVM implementation detail into userspace
> to a degree. The event we're trying to let userspace handle is at the
> intersection of a specific hardware/software state.

Yes, but IMO there's a huge difference between userspace knowing that KVM uses gup()
and memslots to translate gfn=>hva=>pfn, or even knowing that KVM plays games with
reserved stage-2 PTE bits, and userspace knowing exactly how KVM configures stage-2
PTEs.

Another example would be dirty logging on Intel CPUs.  The *admin* can decide
whether to use a write-protection scheme or page-modification logging, but KVM's
ABI with userspace provides a layer of abstraction (dirty ring or bitmap) so that
the userspace VMM doesn't need to do X for write-protection and Y for PML.

> > Abstracting gory hardware details from userspace is one of the main roles of the
> > kernel.
> 
> Where it can be accomplished without a loss (or misrepresentation) of
> information, agreed. But KVM UAPI is so architecture-specific that it
> seems arbitrary to draw the line here.

I don't think it's arbitrary.  KVM's existing uAPI for mapping memory into the
guest is almost entirely arch-neutral, and I want to preserve that for related
functionality unless it's literally impossible to do so.

> > A concrete example of hardware throwing a wrench in things is AMD's upcoming
> > "encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable
> > CPUs for *any* VM that supports guest-controlled encrypted memory.  If KVM reported
> > the page fault error code directly to userspace, then running the same VM on
> > different hardware generations, e.g. after live migration, would generate different
> > error codes.
> >  
> > Are we talking past each other?  I'm genuinely confused by the pushback on
> > capturing RWX information.  Yes, the RWX info may be insufficient in some cases,
> > but its existence doesn't preclude KVM from providing more information as needed.
> 
> My pushback isn't exactly on RWX (even though I noted the MTE quirk
> above). What I'm poking at here is the general infrastructure for
> reflecting faults into userspace, which is aggressively becoming more
> relevant.

But the purpose of memory_fault isn't to reflect faults into userspace, it's to
alert userspace that KVM has encountered a memory fault that requires userspace
action to resolve.

That distinction matters because there are and will be MMU features that KVM
supports, and that can generate novel faults, but such faults won't be punted to
userspace unless KVM provides a way for userspace to explicitly control the MMU
feature.

And if KVM lets userspace control a feature, then KVM needs new uAPI to expose
the controls.  Which means that we should always have an opportunity to expand
memory_fault, e.g. with new flags, to support such features.