On Sat, Jul 13, 2024 at 08:25:42PM +0000, Edgecombe, Rick P wrote: > On Sat, 2024-07-13 at 12:10 +0200, Paolo Bonzini wrote: > > > > > > This patch breaks our rebased TDX development tree. First > > > kvm_gmem_prepare_folio() is called during the KVM_PRE_FAULT_MEMORY > > > operation, > > > then next kvm_gmem_populate() is called during the KVM_TDX_INIT_MEM_REGION > > > ioctl > > > to actually populate the memory, which hits the new -EEXIST error path. > > > > It's not a problem to only keep patches 1-8 for 6.11, and move the > > rest to 6.12 (except for the bit that returns -EEXIST in sev.c). > > > > Could you push a branch for me to take a look? > > Sure, here it is. > > KVM: > https://github.com/rpedgeco/linux/tree/tdx_kvm_dev-2024-07-12-mark_uptodate_issue > Matching QEMU: > https://github.com/intel-staging/qemu-tdx/releases/tag/tdx-qemu-wip-2024.06.19-v9.0.0 > > It is not fully based on kvm-coco-queue because it has the latest v2 of the > zapping quirk swapped in. > > > I've never liked that > > you have to do the explicit prefault before the VM setup is finished; > > it's a TDX-specific detail that is transpiring into the API. > > Well, it's not too late to change direction again. I remember you and Sean were > not fully of one mind on the tradeoffs. > > I guess this series is trying to help userspace not mess up the order of things > for SEV, where as TDX's design was to let userspace hold the pieces from the > beginning. As in, needing to match up the KVM_PRE_FAULT_MEMORY and > KVM_TDX_INIT_MEM_REGION calls, mysteriously return errors in later IOCTLs if > something was missed, etc. If SNP were to try to call KVM_PRE_FAULT_MEMORY before SNP_LAUNCH_UPDATE (rough equivalent to KVM_TDX_INIT_MEM_REGION), I think the same issue would arise, and in that case the uptodate flag you prototyped would wouldn't be enough to address it because SNP_LAUNCH_UPDATE would end up failing because the gmem_prepare hook previously triggered by KVM_PRE_FAULT_MEMORY would have put the corresponding RMP entries into an unexpected state (guest-owned/private). So for SNP, KVM_PRE_FAULT_MEMORY/SNP_LAUNCH_UPDATE are mutually exclusive on what GPA ranges they can prep before finalizing launch state. *After* finalizing launch state however, KVM_PRE_FAULT_MEMORY can be called for whatever range it likes. If gmem_prepare/gmem_populate was already called for a GPA, the uptodate flag will be set and KVM only needs to deal with the mapping. So I wonder if it would be possible to enforce that KVM_PRE_FAULT_MEMORY only be used after finalizing the VM in the CoCo case? I realize that is awkward for TDX, where the KVM_PRE_FAULT_MEMORY is required to create the sEPT mapping before encrypting, but maybe it would be possible for TDX to just do that implicitly within KVM_TDX_INIT_MEM_REGION? That would free up KVM_PRE_FAULT_MEMORY to be called on any range post-finalization, and all the edge cases prior to finalization could be avoided if we have some way to enforce that finalization has already been done. One thing I'm not sure of is if KVM_TDX_INIT_MEM_REGION for a 4K page could maybe lead to a 2M sEPT mapping that overlaps with a GPA range passed to KVM_PRE_FAULT_MEMORY, which I think could lead to unexpected 'left' return values unless we can make sure to only map exactly the GPA ranges populated by KVM_TDX_INIT_MEM_REGION and nothing more. -Mike > > Still, I might lean towards staying the course just because we have gone down > this path for a while and we don't currently have any fundamental issues. > Probably we *really* need to get the next TDX MMU stuff posted so we can start > to add a bit more certainty to that statement. > > > > > > Given we are not actually populating during KVM_PRE_FAULT_MEMORY and try to > > > avoid booting a TD until we've done so, maybe setting folio_mark_uptodate() > > > in > > > kvm_gmem_prepare_folio() is not appropriate in that case? But it may not be > > > easy > > > to separate. > > > > It would be easy (just return a boolean value from > > kvm_arch_gmem_prepare() to skip folio_mark_uptodate() before the VM is > > ready, and implement it for TDX) but it's ugly. You're also clearing > > the memory unnecessarily before overwriting it. > > Hmm, right. Since kvm_gmem_populate() does folio_mark_uptodate() again despite > testing for it earlier, we can skip folio_mark_uptodate() in > kvm_gmem_prepare_folio() for TDX during the pre-finalization stage and it will > get marked there. > > I put a little POC of this suggestion at the end of the branch. Just revert it > to reproduce the issue. > > I think in the context of the work to launch a TD, extra clearing of pages is > not too bad. I'm more bothered by how it highlights the general pitfalls of > TDX's special clever behavior for KVM_PRE_FAULT_MEMORY before TD initialization. > > If/when we want to skip it, I wonder if we could move the clearing into the > gmem_prepare callbacks.