On Wed, Jan 10, 2024 at 07:45:36AM -0800, Sean Christopherson wrote: > On Sat, Dec 30, 2023, Michael Roth wrote: > > diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst > > index b1beb2fe8766..d4325b26724c 100644 > > --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst > > +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst > > @@ -485,6 +485,34 @@ Returns: 0 on success, -negative on error > > > > See the SEV-SNP specification for further detail on the launch input. > > > > +20. KVM_SNP_LAUNCH_UPDATE > > +------------------------- > > + > > +The KVM_SNP_LAUNCH_UPDATE is used for encrypting a memory region. It also > > +calculates a measurement of the memory contents. The measurement is a signature > > +of the memory contents that can be sent to the guest owner as an attestation > > +that the memory was encrypted correctly by the firmware. > > + > > +Parameters (in): struct kvm_snp_launch_update > > + > > +Returns: 0 on success, -negative on error > > + > > +:: > > + > > + struct kvm_sev_snp_launch_update { > > + __u64 start_gfn; /* Guest page number to start from. */ > > + __u64 uaddr; /* userspace address need to be encrypted */ > > Huh? Why is KVM taking a userspace address? IIUC, the address unconditionally > gets translated into a gfn, so why not pass a gfn? > > And speaking of gfns, AFAICT start_gfn is never used. I think having both the uaddr and start_gfn parameters makes sense. I think it's only awkward because how I'm using the memslot to translate the uaddr to a GFN in the current implementation, but: a) It's actually not a requirement that uaddr be associated with a memslot. It could just as easily be any random userspace address containing the payload that we want to copy into the actual gmem pages associated with start_gfn. I think TDX does something similar in that regard, and it makes sense to give VMMs the option of handling things that way. b) If we switch to just having start_gfn, and no uaddr, then things get awkward because then you really do need to have a memslot set up to get at the payload, and have some way of pre-populating the gmem pages prior to conversion, either the way the current code does it (via copying shared memory prior to conversion), or by having some way to populate the gmem pages directly, which is even more painful. > > Oof, reading more of the code, this *requires* an effective in-place copy-and-convert > of guest memory. Yes, I'm having some trouble locating the various threads, but initially there were some discussions about having a userspace API that allows for UPM/gmem pages to be pre-populated before they are in-place encrypted, but we'd all eventually decided that having KVM handle this internally was the simplest approach. So that's how it's done here, KVM_SNP_LAUNCH_UPDATE copies the pages into gmem, then passes those pages on to firmware for encryption. Then the VMM is expected to mark the GFN range as private via KVM_SET_MEMORY_ATTRIBUTES, since the VMM understands what constitutes initial private/encrypted payload. I should document that better in KVM_SNP_LAUNCH_UPDATE docs however. > > > + __u32 len; /* length of memory region */ > > Bytes? Pages? One field above operates on frame numbers, one apparently operates > on a byte-granularity address. If we implement things as mentioned above, it makes sense to decouple uaddr from any page alignment/size restrictions since it would always be copied into the target gmem page starting at byte 0. This sort of assumes that the gmem page will initially be zero'd however, which is the case currently, but there's a TODO in kvm_gmem_get_folio() about potentially off-loading that to firmware. I'm not sure it would ever be applicable for these pages though. Worst case, KVM_SNP_LAUNCH_UPDATE can pad with 0's. > > > + __u8 imi_page; /* 1 if memory is part of the IMI */ > > What's "the IMI"? Initial Measurement Image? Yes, though the SNP Firmware ABI also references it as "Incoming Migration Image", which I think is a little clearer about its purpose and so that's the terminology I've been using in the kernel. > What's "the IMI"? Initial Measurement Image? I assume this is essentially just > a flag that communicates whether or not the page should be measured? This is actually for loading a measured migration agent into the target system so that it can handle receiving the encrypted guest data from the source. There's still a good deal of planning around how live migration will be handled however so if you think it's premature to expose this via KVM I can remove the related fields. > > > + __u8 page_type; /* page type */ > > + __u8 vmpl3_perms; /* VMPL3 permission mask */ > > + __u8 vmpl2_perms; /* VMPL2 permission mask */ > > + __u8 vmpl1_perms; /* VMPL1 permission mask */ > > Why? KVM doesn't support VMPLs. It does actually get used by the SVSM. I can remove these but then we'd need some capability bit or something to know when they are available if they get re-introduced. But that may be needed anyway since KVM needs some additional changes to handle scheduling threads running at different VMPL levels. > > > +static int snp_page_reclaim(u64 pfn) > > +{ > > + struct sev_data_snp_page_reclaim data = {0}; > > + int err, rc; > > + > > + data.paddr = __sme_set(pfn << PAGE_SHIFT); > > + rc = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err); > > + if (rc) { > > + /* > > + * If the reclaim failed, then page is no longer safe > > + * to use. > > Uh, why can reclaim fail, and why does the kernel apparently not care about > leaking pages? AFAICT, nothing ever complains beyond a pr_debug. That sounds > bonkers to me, i.e. at the very minimum, why doesn't this warrant a WARN_ON_ONCE? PAGE_RECLAIM shouldn't happen in practice, so yes, it makes sense to warn about this when it does. snp_leak_pages() is probably the most consistent/user-friendly place to convey these failures so I'll add a pr_warn() there. > > > + */ > > + snp_leak_pages(pfn, 1); > > + } > > + > > + return rc; > > +} > > + > > +static int host_rmp_make_shared(u64 pfn, enum pg_level level, bool leak) > > +{ > > + int rc; > > + > > + rc = rmp_make_shared(pfn, level); > > + if (rc && leak) > > + snp_leak_pages(pfn, > > + page_level_size(level) >> PAGE_SHIFT); > > Completely unnecessary wrap. > > > + > > + return rc; > > +} > > + > > static void sev_unbind_asid(struct kvm *kvm, unsigned int handle) > > { > > struct sev_data_deactivate deactivate; > > @@ -1990,6 +2020,154 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp) > > return rc; > > } > > > > +static int snp_launch_update_gfn_handler(struct kvm *kvm, > > + struct kvm_gfn_range *range, > > + void *opaque) > > +{ > > + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; > > + struct kvm_memory_slot *memslot = range->slot; > > + struct sev_data_snp_launch_update data = {0}; > > + struct kvm_sev_snp_launch_update params; > > + struct kvm_sev_cmd *argp = opaque; > > + int *error = &argp->error; > > + int i, n = 0, ret = 0; > > + unsigned long npages; > > + kvm_pfn_t *pfns; > > + gfn_t gfn; > > + > > + if (!kvm_slot_can_be_private(memslot)) { > > + pr_err("SEV-SNP requires private memory support via guest_memfd.\n"); > > Yeah, no. Sprinkling pr_err() all over the place in user-triggerable error paths > is not acceptable. I get that it's often hard to extract "what went wrong" out > of the kernel, but adding pr_err() everywhere is not a viable solution. Makes sense, I'll drop this. > > > + return -EINVAL; > > + } > > + > > + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) { > > + pr_err("Failed to copy user parameters for SEV-SNP launch.\n"); > > + return -EFAULT; > > + } > > + > > + data.gctx_paddr = __psp_pa(sev->snp_context); > > + > > + npages = range->end - range->start; > > + pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL_ACCOUNT); > > + if (!pfns) > > + return -ENOMEM; > > + > > + pr_debug("%s: GFN range 0x%llx-0x%llx, type %d\n", __func__, > > + range->start, range->end, params.page_type); > > + > > + for (gfn = range->start, i = 0; gfn < range->end; gfn++, i++) { > > + int order, level; > > + bool assigned; > > + void *kvaddr; > > + > > + ret = __kvm_gmem_get_pfn(kvm, memslot, gfn, &pfns[i], &order, false); > > + if (ret) > > + goto e_release; > > + > > + n++; > > + ret = snp_lookup_rmpentry((u64)pfns[i], &assigned, &level); > > + if (ret || assigned) { > > + pr_err("Failed to ensure GFN 0x%llx is in initial shared state, ret: %d, assigned: %d\n", > > + gfn, ret, assigned); > > + return -EFAULT; > > + } > > + > > + kvaddr = pfn_to_kaddr(pfns[i]); > > + if (!virt_addr_valid(kvaddr)) { > > I really, really don't like that this assume guest_memfd is backed by struct page. There are similar enforcements in the SEV/SEV-ES code. There was some initial discussion about relaxing this for SNP so we could support things like /dev/mem-mapped guest memory, but then guest_memfd came along and made that to be an unlikely use-case in the near-term given that it relies on alloc_pages() currently and explicitly guards against mmap()'ing pages in userspace. I think it makes to keep the current tightened restrictions in-place until such a use-case comes along, since otherwise we are relaxing a bunch of currently-useful sanity checks that span all throughout the code to support some nebulous potential use-case that might never come along. I think it makes more sense to cross that bridge when we get there. > > > + pr_err("Invalid HVA 0x%llx for GFN 0x%llx\n", (uint64_t)kvaddr, gfn); > > + ret = -EINVAL; > > + goto e_release; > > + } > > + > > + ret = kvm_read_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE); > > Good gravy. If I'm reading this correctly, KVM: > > 1. Translates an HVA into a GFN. > 2. Gets the PFN for that GFN from guest_memfd > 3. Verifies the PFN is not assigned to the guest > 4. Copies memory from the shared memslot page to the guest_memfd page > 5. Converts the page to private and asks the PSP to encrypt it > > (a) As above, why is #1 a thing? Yah, it's probably best to avoid this, as proposed above. > (b) Why are KVM's memory attributes never consulted? It doesn't really matter if the attributes are set before or after KVM_SNP_LAUNCH_UPDATE, only that by the time the guest actually launches they pages get set to private so they get faulted in from gmem. We could document our expectations and enforce them here if that's preferable however. Maybe requiring KVM_SET_MEMORY_ATTRIBUTES(private) in advance would make it easier to enforce that userspace does the right thing. I'll see how that looks if there are no objections. > (c) What prevents TOCTOU issues with respect to the RMP? Time-of-use will be when the guest faults the gmem page in with C-bit set. If it is not in the expected Guest-owned/pre-validated state that SEV_CMD_SNP_LAUNCH_UPDATE expected/set, then the guest will get an RMP fault or #VC exception for unvalidated page access. It will also fail attestation. Not sure if that covers the scenarios you had in mind. > (d) Why is *KVM* copying memory into guest_memfd? As mentioned above, there were various discussions of ways of allowing userspace to pwrite() to the guest_memfd in advance of "sealing"/"binding" it and then encrypting it in place. I think this was one of the related threads: https://lore.kernel.org/linux-mm/YkyKywkQYbr9U0CA@xxxxxxxxxx/ My read at the time was that the requirements between pKVM/TDX/SNP were all so unique that we'd spin forever trying to come up with a userspace ABI that worked for everyone. At the time you'd suggested for pKVM to handle their specific requirements internally in pKVM to avoid unecessary churn on TDX/SNP side, and I took the same approach with SNP in implementing it internally in SNP's KVM interfaces since it seemed unlikely there would be much common ground with how TDX handles it via KVM_TDX_INIT_MEM_REGION. > (e) What guarantees the direct map is valid for guest_memfd? Are you suggesting this may change in the near-term? If so, we can re-work the code to write to guest_memfd via a temporary mapping or something, but otherwise it seems awkward to account for that scenario in current code given that SNP specifically has hooks to remove/re-add directmap entries based on RMPUPDATEs to avoid host breakage, so we would necessarily need to implement changes if guest_memfd ever made any changes in this regard. And we had prior discussions about handling directmap invalidation in guest_memfd, but Kirill mentioned here[1] that special handling didn't actually seem to be a requirement of TDX private memory, and so it didn't seem likely that pushing that into gmem would be a welcome change. All that said, TDX does still seem to invalidate directmap entries as part of tdh_mem_page_add(), so maybe there is a requirement there and this is worth revisiting? If so though, it's worth mentioning that cpa_lock contention on directmap updates is actually a significant contributor to some scalability issues we've noticed with lots of guests/vCPUs doing lazy acceptance and needing to frequently invalidate directmap entries as part of rmpupdate() during gmem allocations, so we're considering just forcing a 4K directmap for SNP until directmap updates can scale better, so that might be another reason to not have guest_memfd in the business of managing directmap updates until there's some concrete use-case in sight, like being able to rebuild 2MB/1GB directmap entries in a scalable way during run-time. [1] https://lore.kernel.org/linux-mm/20221102212656.6giugw542jdxsvhh@xxxxxxx/ > (f) Why does KVM's uAPI *require* the source page to come from the same memslot? As mentioned above, I think it makes sense to do away with this requirement and just treat source page as any other user-provided blob. > > > + if (ret) { > > + pr_err("Guest read failed, ret: 0x%x\n", ret); > > + goto e_release; > > + } > > + > > + ret = rmp_make_private(pfns[i], gfn << PAGE_SHIFT, PG_LEVEL_4K, > > + sev_get_asid(kvm), true); > > + if (ret) { > > + ret = -EFAULT; > > + goto e_release; > > + } > > + > > + data.address = __sme_set(pfns[i] << PAGE_SHIFT); > > + data.page_size = PG_LEVEL_TO_RMP(PG_LEVEL_4K); > > + data.page_type = params.page_type; > > + data.vmpl3_perms = params.vmpl3_perms; > > + data.vmpl2_perms = params.vmpl2_perms; > > + data.vmpl1_perms = params.vmpl1_perms; > > + ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, > > + &data, error); > > + if (ret) { > > + pr_err("SEV-SNP launch update failed, ret: 0x%x, fw_error: 0x%x\n", > > + ret, *error); > > + snp_page_reclaim(pfns[i]); > > + > > + /* > > + * When invalid CPUID function entries are detected, the firmware > > + * corrects these entries for debugging purpose and leaves the > > + * page unencrypted so it can be provided users for debugging > > + * and error-reporting. > > + * > > + * Copy the corrected CPUID page back to shared memory so > > + * userpsace can retrieve this information. > > Why? IIUC, this is basically backdooring reads/writes into guest_memfd to avoid > having to add proper mmap() support. The CPUID page is private/encrypted, so it needs to be a gmem page. SNP firmware is doing the backdooring when it writes the unencrypted, expected contents into the page during failure. What's wrong with copying the contents back into the source page so userspace can be use of it? If we implement the above-mentioned changes then the source page is just a userspace buffer that isn't necessarily associated with a memslot, so it becomes even more straightforward. Would that be acceptable? I'm not sure what you're proposing with mmap(). > > > + */ > > + if (params.page_type == SNP_PAGE_TYPE_CPUID && > > + *error == SEV_RET_INVALID_PARAM) { > > + int ret; > > Ugh, do not shadow variables. Will fix. > > > + > > + host_rmp_make_shared(pfns[i], PG_LEVEL_4K, true); > > + > > + ret = kvm_write_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE); > > + if (ret) > > + pr_err("Failed to write CPUID page back to userspace, ret: 0x%x\n", > > + ret); > > + } > > + > > + goto e_release; > > + } > > + } > > + > > +e_release: > > + /* Content of memory is updated, mark pages dirty */ > > + for (i = 0; i < n; i++) { > > + set_page_dirty(pfn_to_page(pfns[i])); > > + mark_page_accessed(pfn_to_page(pfns[i])); > > + > > + /* > > + * If its an error, then update RMP entry to change page ownership > > + * to the hypervisor. > > + */ > > + if (ret) > > + host_rmp_make_shared(pfns[i], PG_LEVEL_4K, true); > > + > > + put_page(pfn_to_page(pfns[i])); > > + } > > + > > + kvfree(pfns); > > Saving PFNs from guest_memfd, which is fully owned by KVM, is so unnecessarily > complex. Add a guest_memfd API (or three) to do this safely, e.g. to lock the > pages, do (and track) the RMP conversion, etc... Is adding 3 gmem APIs and tracking RMP states inside gmem really less complex than what's going on here? The PFNs are only held on to for the duration of this single function so they can be cleanly rolled back, and we're using blessed interfaces like kvm_gmem_get_pfn() to get at them. There's some nuances here that I'm not sure will map to a re-usable gmem API that would benefit other users. For instance, we need to: 1) grab the gmem PFN 2) initialize it in some platform-specific way (copy from source buffer in this case) 3) switch it private in RMP table 4) execute SEV_CMD_SNP_LAUNCH_UPDATE firmware cmd If 2-4 can all be done with self-contained platform-specific callback, then I could add a gmem API like: gmem_initialize_gfn_range(start, end, func, opaque) where: func: does roughly what snp_launch_update_gfn_handler currently does for each PFN it is handed opaque: some data structure that would provide the source buffer to initialize the gmem pages from Is that along the lines of what you're suggesting? It wouldn't involve "tracking" RMP conversions, 'func' would be aware of that for each PFN it is handed, but it's simple enough that it is easily re-usable for other platforms without too much fuss. If you really want to build some deeper tracking of RMP table states into gmem internals, then I could really use your feedback on the gmem_prepare() hook I added in this RFC[2] and included as part of this series, because I ended up not implementing tracking for number of reason detailed under "Hooks for preparing gmem pages" in the RFC cover letter and would likely need to revisit that aspect first before building out this interface. [2] https://lore.kernel.org/kvm/20231016115028.996656-1-michael.roth@xxxxxxx/ Thanks, Mike