Re: [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 4 Mar 2024 14:49:07 -0800

On Mon, Mar 04, 2024, Oliver Upton wrote:
> On Mon, Mar 04, 2024 at 12:32:51PM -0800, Sean Christopherson wrote:
> > On Mon, Mar 04, 2024, Oliver Upton wrote:
> > > On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote:
> 
> [...]
> 
> > > Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with
> > > kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should*
> > > shake out as a write fault on the stage-1 descriptor.
> > > 
> > > With that said, an architecture-neutral UAPI may not be able to capture
> > > the nuance of a fault. This UAPI will become much more load-bearing in
> > > the future, and the loss of granularity could become an issue.
> > 
> > What is the possible fallout from loss of granularity/nuance?  E.g. if the worst
> > case scenario is that KVM may exit to userspace multiple times in order to resolve
> > the problem, IMO that's an acceptable cost for having "dumb", common uAPI.
> > 
> > The intent/contract of the exit to userspace isn't for userspace to be able to
> > completely understand what fault occurred, but rather for KVM to communicate what
> > action userspace needs to take in order for KVM to make forward progress.
> 
> For one, the stage-2 page tables can describe permissions beyond RWX.
> MTE tag allocation can be controlled at stage-2, which (confusingly)
> desribes if the guest can insert tags in an opaque, physical space not
> described by HPFAR.
> 
> There is a corresponding bit in ESR_EL2 that describes this at the time
> of a fault, and R/W/X flags aren't enough to convey the right corrective
> action.
>
> > > Marc had some ideas about forwarding the register state to userspace
> > > directly, which should be the right level of information for _any_ fault
> > > taken to userspace.
> > 
> > I don't know enough about ARM to weigh in on that side of things, but for x86
> > this definitely doesn't hold true.
> 
> We tend to directly model the CPU architecture wherever possible, as it
> is the only way to create something intelligible. That same rationale
> applies to a huge portion of KVM UAPI; it is architecture-dependent by
> design.

Heh, "by design" :-)

I'm not saying "no arch-specific code in memory_fault", all I'm saying is that
stuff that can be arch-neutral, should be arch-neutral.  And AFAIK, basic RWX
information is common across all architectures.

E.g. if KVM needs to communicate MTE information on top of basic RWX info, why
not add a flag to memory_fault.flags that communicates that MTE is enabled and
relevant info can be found in an "extended" data field?

The presense of MTE stuff shouldn't affect the fundamental access information,
e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE
irrespective of whether or not MTE is in play.

The one thing we may want to squeak in before 6.8 is released is a placeholder
in memory_fault, though I don't think that's strictly necessary since the union
as a whole is padded to 256 bytes.  I suppose userspace could allocate based on
sizeof(kvm_run.memory_fault), but that's a bit of a stretch.

> > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for
> > "caching" emulated MMIO accesses, and the resulting fault captures the
> > "reserved bits set" information in register state.  But that's purely an
> > (optional) imlementation detail of KVM that should never be exposed to
> > userspace.
> 
> MMIO accesses would show up elsewhere though, right?

Yes, but I don't see how that's relevant.  Maybe I'm just misunderstanding what
you're saying/asking.

> If these magic SPTEs were causing -EFAULT exits then something must've gone
> sideways.

More or less.  This scenario can happen if the guest re-accesses a GFN that
doesn't have a memslot, but in the interim userspace made the GFN private.  It's
likely a misbehaving userspace, but that really doesn't matter.  KVM's contract
is to report that KVM exited to userspace because the guest was trying to access
GFN X as shared, but the GFN is configured as private by userspace.

My point was that dumping fault/register information straight to userspace in this
scenario, without massaging/filtering that information, is not a sane option on
x86.

> Either way, I have no issues whatsoever if the direction for x86 is to
> provide abstracted fault information.

I don't understand how ARM can get away with NOT providing a layer of abstraction.
Copying fault state verbatim to userspace will bleed KVM implementation details
into userspace, and risks breakage of KVM's ABI due to changes in hardware.
Abstracting gory hardware details from userspace is one of the main roles of the
kernel.

A concrete example of hardware throwing a wrench in things is AMD's upcoming
"encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable
CPUs for *any* VM that supports guest-controlled encrypted memory.  If KVM reported
the page fault error code directly to userspace, then running the same VM on
different hardware generations, e.g. after live migration, would generate different
error codes.

Are we talking past each other?  I'm genuinely confused by the pushback on
capturing RWX information.  Yes, the RWX info may be insufficient in some cases,
but its existence doesn't preclude KVM from providing more information as needed.