On Mon, May 22, 2023 at 10:09:40AM -0700, Sean Christopherson wrote: > On Mon, May 22, 2023, Michael Roth wrote: > > On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote: > > > On Thu, May 11, 2023, Michael Roth wrote: > > I put together a tree with some fixups that are needed for against the > > kvm_gmem_solo base tree, and a set of hooks to handle invalidations, > > preparing the initial private state as suggested above, and a > > platform-configurable mask that the x86 MMU code can use for determining > > whether a fault is for private vs. shared pages. > > > > KVM: x86: Determine shared/private faults using a configurable mask > > ^ for TDX we could trivially add an inverted analogue of the mask/logic > > KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault > > KVM: x86: Add platform hooks for private memory invalidations > > Hrm, I'd prefer to avoid adding another hook for this case, arch code already has > a "hook" in the form of kvm_unmap_gfn_range(). We'd probably just need a > kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory > being zapped is private. kvm_unmap_gfn_range() does however get called with kvm->mmu_lock held so it might be tricky to tie RMP updates into that path. > > That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked > if and only if there's an overlapping memslot. I'll chew on that a bit to see if > there's a way to cleanly handle that case without another hook. I think it's worth > mapping out exactly what we want unbind() to look like anyways, e.g. right now the > code subtly relies on private memslots being immutable. I thought the direction you sort of driving at was to completely decouple RMP updates for physical pages from the KVM MMU map/unmap paths since the life-cycles of those backing pages and associated RMP state are somewhat separate from the state of the GFNs and kvm->mem_attr_array. It seems to make sense when dealing with things like this unbind() case. There's also cases like userspaces that opt to not discard memory after conversions because they highly favor performance over memory usage. In those cases it would make sense to defer marking the pages as shared in the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via KVM MMU invalidation path after a conversion. > > > KVM: x86: Add platform hook for initializing private memory > > This should also be unnecessary. The call to kvm_gmem_get_pfn() is from arch > code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock, > e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault(). I think that approach would work fine. The way I was thinking of things is that KVM MMU would necessarily call kvm_gmem_get_pfn() to grab the page before mapping it into the guest, so moving it out into an explicit call should work just as well. That would also drop the need for the __kvm_gmem_get_pfn() stuff I needed to add for the initial case where we need to access the PFN prior to making it private. > > The only reason to add another arch hook would be if we wanted to converted the > RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of > waiting until #NPF. But I think I would rather add a generic ioctl() to allow > userspace to effectively prefault guest memory, e.g. to setup the RMP before > running a vCPU. Such an ioctl() would potentially be useful in other scenarios, > e.g. on the dest during live migration to reduce jitter. Agreed, deferring the RMPUPDATE until it's actually needed would give us more flexibility on optimizing for things like lazy-acceptance. For less-common scenarios like preallocation it makes sense to make that an opt-in sort of thing for userspace to configure explicitly. > > > *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations > > There was another bug in this path. The math for handling a non-zero offsets into > the file was wrong. The code now looks like: > > xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) { > struct kvm_gfn_range gfn_range = { > .start = slot->base_gfn + start - slot->gmem.index, Sorry if I'm missing something here, but isn't there a risk that: start - slot->gmem.index would be less than zero? E.g. starting GFN was 0, but current slot is bound at some non-zero offset in the same gmem instance. I guess the warning below shouldn't caught that, but it seems like a real scenario. Since 'index' corresponds to the gmem offset of the current slot, is there any reason not to do something like this?: .start = slot->base_gfn + index - slot->gmem.index, But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting we case just simplify to this?: .start = slot->base_gfn, -Mike > .end = slot->base_gfn + min(end - slot->gmem.index, slot->npages), > .slot = slot, > .pte = __pte(0), > .may_block = true, > }; > > if (WARN_ON_ONCE(start < slot->gmem.index || > end > slot->gmem.index + slot->npages)) > continue; > > kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end); > > flush |= kvm_unmap_gfn_range(kvm, &gfn_range); > }