Re: [PATCH 09/12] KVM: guest_memfd: move check for already-populated page to common code

Michael Roth <michael.roth@xxxxxxx> · Sun, 14 Jul 2024 00:32:49 -0500

On Sat, Jul 13, 2024 at 08:25:42PM +0000, Edgecombe, Rick P wrote:
> On Sat, 2024-07-13 at 12:10 +0200, Paolo Bonzini wrote:
> > > 
> > > This patch breaks our rebased TDX development tree. First
> > > kvm_gmem_prepare_folio() is called during the KVM_PRE_FAULT_MEMORY
> > > operation,
> > > then next kvm_gmem_populate() is called during the KVM_TDX_INIT_MEM_REGION
> > > ioctl
> > > to actually populate the memory, which hits the new -EEXIST error path.
> > 
> > It's not a problem to only keep patches 1-8 for 6.11, and move the
> > rest to 6.12 (except for the bit that returns -EEXIST in sev.c).
> > 
> > Could you push a branch for me to take a look?
> 
> Sure, here it is.
> 
> KVM:
> https://github.com/rpedgeco/linux/tree/tdx_kvm_dev-2024-07-12-mark_uptodate_issue
> Matching QEMU:
> https://github.com/intel-staging/qemu-tdx/releases/tag/tdx-qemu-wip-2024.06.19-v9.0.0
> 
> It is not fully based on kvm-coco-queue because it has the latest v2 of the
> zapping quirk swapped in.
> 
> >  I've never liked that
> > you have to do the explicit prefault before the VM setup is finished;
> > it's a TDX-specific detail that is transpiring into the API.
> 
> Well, it's not too late to change direction again. I remember you and Sean were
> not fully of one mind on the tradeoffs.
> 
> I guess this series is trying to help userspace not mess up the order of things
> for SEV, where as TDX's design was to let userspace hold the pieces from the
> beginning. As in, needing to match up the KVM_PRE_FAULT_MEMORY and
> KVM_TDX_INIT_MEM_REGION calls, mysteriously return errors in later IOCTLs if
> something was missed, etc.

If SNP were to try to call KVM_PRE_FAULT_MEMORY before SNP_LAUNCH_UPDATE
(rough equivalent to KVM_TDX_INIT_MEM_REGION), I think the same issue
would arise, and in that case the uptodate flag you prototyped would
wouldn't be enough to address it because SNP_LAUNCH_UPDATE would end up
failing because the gmem_prepare hook previously triggered by
KVM_PRE_FAULT_MEMORY would have put the corresponding RMP entries into
an unexpected state (guest-owned/private).

So for SNP, KVM_PRE_FAULT_MEMORY/SNP_LAUNCH_UPDATE are mutually
exclusive on what GPA ranges they can prep before finalizing launch state.

*After* finalizing launch state however, KVM_PRE_FAULT_MEMORY can be
called for whatever range it likes. If gmem_prepare/gmem_populate was
already called for a GPA, the uptodate flag will be set and KVM only
needs to deal with the mapping.

So I wonder if it would be possible to enforce that KVM_PRE_FAULT_MEMORY
only be used after finalizing the VM in the CoCo case?

I realize that is awkward for TDX, where the KVM_PRE_FAULT_MEMORY is
required to create the sEPT mapping before encrypting, but maybe it
would be possible for TDX to just do that implicitly within
KVM_TDX_INIT_MEM_REGION?

That would free up KVM_PRE_FAULT_MEMORY to be called on any range
post-finalization, and all the edge cases prior to finalization could be
avoided if we have some way to enforce that finalization has already
been done.

One thing I'm not sure of is if KVM_TDX_INIT_MEM_REGION for a 4K page could
maybe lead to a 2M sEPT mapping that overlaps with a GPA range passed to
KVM_PRE_FAULT_MEMORY, which I think could lead to unexpected 'left' return
values unless we can make sure to only map exactly the GPA ranges populated
by KVM_TDX_INIT_MEM_REGION and nothing more.

-Mike

> 
> Still, I might lean towards staying the course just because we have gone down
> this path for a while and we don't currently have any fundamental issues.
> Probably we *really* need to get the next TDX MMU stuff posted so we can start
> to add a bit more certainty to that statement.
> 
> > 
> > > Given we are not actually populating during KVM_PRE_FAULT_MEMORY and try to
> > > avoid booting a TD until we've done so, maybe setting folio_mark_uptodate()
> > > in
> > > kvm_gmem_prepare_folio() is not appropriate in that case? But it may not be
> > > easy
> > > to separate.
> > 
> > It would be easy (just return a boolean value from
> > kvm_arch_gmem_prepare() to skip folio_mark_uptodate() before the VM is
> > ready, and implement it for TDX) but it's ugly. You're also clearing
> > the memory unnecessarily before overwriting it.
> 
> Hmm, right. Since kvm_gmem_populate() does folio_mark_uptodate() again despite
> testing for it earlier, we can skip folio_mark_uptodate() in
> kvm_gmem_prepare_folio() for TDX during the pre-finalization stage and it will
> get marked there.
> 
> I put a little POC of this suggestion at the end of the branch. Just revert it
> to reproduce the issue.
> 
> I think in the context of the work to launch a TD, extra clearing of pages is
> not too bad. I'm more bothered by how it highlights the general pitfalls of
> TDX's special clever behavior for KVM_PRE_FAULT_MEMORY before TD initialization.
> 
> If/when we want to skip it, I wonder if we could move the clearing into the
> gmem_prepare callbacks.