Re: [PATCH RFC v9 04/51] KVM: x86: Determine shared/private faults using a configurable mask

Michael Roth <michael.roth@xxxxxxx> · Thu, 22 Jun 2023 10:32:29 -0500

On Thu, Jun 22, 2023 at 09:55:22AM +0000, Huang, Kai wrote:
> 
> > 
> > So if we were to straight-forwardly implement that based on how TDX
> > currently handles checking for the shared bit in GPA, paired with how
> > SEV-SNP handles checking for private bit in fault flags, it would look
> > something like:
> > 
> >   bool kvm_fault_is_private(kvm, gpa, err)
> >   {
> >     /* SEV-SNP handling */
> >     if (kvm->arch.mmu_private_fault_mask)
> >       return !!(err & arch.mmu_private_fault_mask);
> > 
> >     /* TDX handling */
> >     if (kvm->arch.gfn_shared_mask)
> >       return !!(gpa & arch.gfn_shared_mask);
> 
> The logic of the two are identical.  I think they need to be converged.

I think they're just different enough that trying too hard to converge
them might obfuscate things. If the determination didn't come from 2
completely different fields (gpa vs. fault flags) maybe it could be
simplified a bit more, but have well-defined open-coded handler that
gets called once to set fault->is_private during initial fault time so
that that ugliness never needs to be looked at again by KVM MMU seems
like a good way to keep things simple through the rest of the handling.

> 
> Either SEV-SNP should convert the error code private bit to the gfn_shared_mask,
> or TDX's shared bit should be converted to some private error bit.

struct kvm_page_fault seems to be the preferred way to pass additional
params/metadata around, and .is_private field was introduced to track
this private/shared state as part of UPM base series:

  https://lore.kernel.org/lkml/20221202061347.1070246-9-chao.p.peng@xxxxxxxxxxxxxxx/

So it seems like unecessary complexity to track/encode that state into
other additional places rather than just encapsulating it all in
fault->is_private (or some similar field), and synthesizing all this
platform-specific handling into a well-defined value that's conveyed
by something like fault->is_private in a way where KVM MMU doesn't need
to worry as much about platform-specific stuff seems like a good thing,
and in line with what the UPM base series was trying to do by adding the
fault->is_private field.

So all I'm really proposing is that whatever SNP and TDX end up doing
should center around setting that fault->is_private field and keeping
everything contained there. If there are better ways to handle *how*
that's done I don't have any complaints there, but moving/adding bits
to GPA/error_flags after fault time just seems unecessary to me when
fault->is_private field can serve that purpose just as well.

> 
> Perhaps converting SEV-SNP makes more sense because if I recall correctly SEV
> guest also has a C-bit, correct?

That's correct, but the C-bit doesn't show in the GPA that gets passed
up to KVM during an #NPF, and instead gets encoded into the fault's
error_flags.

> 
> Or, ...
> 
> > 
> >     return false;
> >   }
> > 
> >   kvm_mmu_do_page_fault(vcpu, gpa, err, ...)
> >   {
> >     struct kvm_page_fault fault = {
> >       ...
> >       .is_private = kvm_fault_is_private(vcpu->kvm, gpa, err)
> 
> ... should we do something like:
> 
> 	.is_private = static_call(kvm_x86_fault_is_private)(vcpu->kvm, gpa, 
> 							    err);

We actually had exactly this in v7 of SNP hypervisor patches:

  https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@xxxxxxx/T/#m17841f5bfdfb8350d69d78c6831dd8f3a4748638

but Sean was hoping to avoid a callback, which is why we ended up using
a bitmask in this version since it basically all that callback would
need to do. It's unfortunately that we don't have a common scheme
between SNP/TDX, but maybe that's still possible, I just think that
whatever that ends up being, it should live and be contained inside
whatever helper ends up setting fault->is_private.

There's some other awkwardness with a callback approach. It sort of ties
into your question about selftests so I'll address it below...

> 
> ?
> 
> >     };
> > 
> >     ...
> >   }
> > 
> > And then arch.mmu_private_fault_mask and arch.gfn_shared_mask would be
> > set per-KVM-instance, just like they are now with current SNP and TDX
> > patchsets, since stuff like KVM self-test wouldn't be setting those
> > masks, so it makes sense to do it per-instance in that regard.
> > 
> > But that still gets a little awkward for the KVM self-test use-case where
> > .is_private should sort of be ignored in favor of whatever the xarray
> > reports via kvm_mem_is_private(). 
> > 
> 
> I must have missed something.  Why does KVM self-test have impact to how does
> KVM handles private fault? 

The self-tests I'm referring to here are the ones from Vishal that shipped with
v10 of Chao's UPM/fd-based private memory series, and also as part of Sean's
gmem tree:

  https://github.com/sean-jc/linux/commit/a0f5f8c911804f55935094ad3a277301704330a6

These exercise gmem/UPM handling without the need for any SNP/TDX
hardware support. They do so by "trusting" the shared/private state
that the VMM sets via KVM_SET_MEMORY_ATTRIBUTES. So if VMM says it
should be private, KVM MMU will treat it as private, so we'd never
get a mismatch, so KVM_EXIT_MEMORY_FAULT will never be generated.

> 
> > In your Misc. series I believe you
> > handled this by introducing a PFERR_HASATTR_MASK bit so we can determine
> > whether existing value of fault->is_private should be
> > ignored/overwritten or not.
> > 
> > So maybe kvm_fault_is_private() needs to return an integer value
> > instead, like:
> > 
> >   enum {
> >     KVM_FAULT_VMM_DEFINED,
> >     KVM_FAULT_SHARED,
> >     KVM_FAULT_PRIVATE,
> >   }
> > 
> >   bool kvm_fault_is_private(kvm, gpa, err)
> >   {
> >     /* SEV-SNP handling */
> >     if (kvm->arch.mmu_private_fault_mask)
> >       (err & arch.mmu_private_fault_mask) ? KVM_FAULT_PRIVATE : KVM_FAULT_SHARED
> > 
> >     /* TDX handling */
> >     if (kvm->arch.gfn_shared_mask)
> >       (gpa & arch.gfn_shared_mask) ? KVM_FAULT_SHARED : KVM_FAULT_PRIVATE
> > 
> >     return KVM_FAULT_VMM_DEFINED;
> >   }
> > 
> > And then down in __kvm_faultin_pfn() we do:
> > 
> >   if (fault->is_private == KVM_FAULT_VMM_DEFINED)
> >     fault->is_private = kvm_mem_is_private(vcpu->kvm, fault->gfn);
> >   else if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> >     return kvm_do_memory_fault_exit(vcpu, fault);
> > 
> >   if (fault->is_private)
> >     return kvm_faultin_pfn_private(vcpu, fault);
> 
> 
> What does KVM_FAULT_VMM_DEFINED mean, exactly?
> 
> Shouldn't the fault type come from _hardware_?

In above self-test use-case, there is no reliance on hardware support,
and fault->is_private should always be treated as being whatever was
set by the VMM via KVM_SET_MEMORY_ATTRIBUTES, so that's why I proposed
the KVM_FAULT_VMM_DEFINED value to encode that case into
fault->is_private so KVM MMU and handle protected self-test VMs of this
sort.

In the future, this protected self-test VMs might become the basis of
a new protected VM type where some sort of guest-issued hypercall can
be used to set whether a fault should be treated as shared/private,
rather than relying on VMM-defined value. There's some current discussion
about that here:

  https://lore.kernel.org/lkml/20230620190443.GU2244082@xxxxxxxxxxxxxxxxxxxxx/T/#me627bed3d9acf73ea882e8baa76dfcb27759c440

Going back to your callback question above, that makes things a little
awkward, since kvm_x86_ops is statically defined for both
kvm_amd/kvm_intel modules, and either can run these self-tests guests as
well as these proposed "non-CC VMs" which rely on enlightened guest
kernels instead of TDX/SNPhardware support for managing private/shared
access.

So you either need to duplicate the handling for how to determine
private/shared for these other types into the kvm_intel/kvm_amd callbacks,
or have some way for the callback to say to "fall back to the common
handling for self-tests and non-CC VMs". The latter is what we implemented
in v8 of this series, but Isaku suggested it was a bit too heavyweight
and proposed dropping the fall-back logic in favor of updating the
kvm_x86_ops at run-time once we know whether or not it's a TDX/SNP guest:

  https://lkml.iu.edu/hypermail/linux/kernel/2303.2/03009.html

which could work, but it still doesn't address Sean's desire to avoid
callbacks completely, and still amounts to a somewhat convulated way
to hide away TDX/SNP-specific bit checks for shared/private. Rather
than hide them away in callbacks that are already frowned upon by
maintainer, I think it makes sense to "open-code" all these checks in a
common handler like kvm_fault_is_private() to we can make some progress
toward a consensus, and then iterate on it from there rather than
refining what may already be a dead-end path.

-Mike