In a CoCo VM when a page transitions from private to shared, or vice versa, attributes in the PTE must be updated *and* the hypervisor must be notified of the change. Because there are two separate steps, there's a window where the settings are inconsistent. Normally the code that initiates the transition (via set_memory_decrypted() or set_memory_encrypted()) ensures that the memory is not being accessed during a transition, so the window of inconsistency is not a problem. However, the load_unaligned_zeropad() function can read arbitrary memory pages at arbitrary times, which could access a transitioning page during the window. In such a case, CoCo VM specific exceptions are taken (depending on the CoCo architecture in use). Current code in those exception handlers recovers and does "fixup" on the result returned by load_unaligned_zeropad(). Unfortunately, this exception handling and fixup code is tricky and somewhat fragile. At the moment, it is broken for both TDX and SEV-SNP. There's also a problem with the current code in paravisor scenarios: TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need to be forwarded from the paravisor to the Linux guest, but there are no architectural specs for how to do that. To avoid these complexities of the CoCo exception handlers, change the core transition code in __set_memory_enc_pgtable() to do the following: 1. Remove aliasing mappings 2. Remove the PRESENT bit from the PTEs of all transitioning pages 3. Flush the TLB globally 4. Flush the data cache if needed 5. Set/clear the encryption attribute as appropriate 6. Notify the hypervisor of the page status change 7. Add back the PRESENT bit With this approach, load_unaligned_zeropad() just takes its normal page-fault-based fixup path if it touches a page that is transitioning. As a result, load_unaligned_zeropad() and CoCo VM page transitioning are completely decoupled. CoCo VM page transitions can proceed without needing to handle architecture-specific exceptions and fix things up. This decoupling reduces the complexity due to separate TDX and SEV-SNP fixup paths, and gives more freedom to revise and introduce new capabilities in future versions of the TDX and SEV-SNP architectures. Paravisor scenarios work properly without needing to forward exceptions. This approach may make __set_memory_enc_pgtable() slightly slower because of touching the PTEs three times instead of just once. But the run time of this function is already dominated by the hypercall and the need to flush the TLB at least once and maybe twice. In any case, this function is only used for CoCo VM page transitions, and is already unsuitable for hot paths. The architecture specific callback function for notifying the hypervisor typically must translate guest kernel virtual addresses into guest physical addresses to pass to the hypervisor. Because the PTEs are invalid at the time of callback, the code for doing the translation needs updating. virt_to_phys() or equivalent continues to work for direct map addresses. But vmalloc addresses cannot use vmalloc_to_page() because that function requires the leaf PTE to be valid. Instead, slow_virt_to_phys() must be used. Both functions manually walk the page table hierarchy, so performance is the same. Signed-off-by: Michael Kelley <mikelley@xxxxxxxxxxxxx> --- I'm assuming the TDX handling of the data cache flushing is the same with this new approach, and that it doesn't need to be paired with a TLB flush as in the current code. If that's not a correct assumption, let me know. I've left the two hypervisor callbacks, before and after Step 5 above. If the PTEs are invalid, it seems like the order of Step 5 and Step 6 shouldn't matter, so perhaps one of the callback could be dropped. Let me know if there are reasons to do otherwise. It may well be possible to optimize the new implementation of __set_memory_enc_pgtable(). The existing set_memory_np() and set_memory_p() functions do all the right things in a very clear fashion, but perhaps not as optimally as having all three PTE manipulations directly in the same function. It doesn't appear that optimizing the performance really matters here, but I'm open to suggestions. I've tested this on TDX VMs and SEV-SNP + vTOM VMs. I can also test on SEV-SNP VMs without vTOM. But I'll probably need help covering SEV and SEV-ES VMs. This RFC patch does *not* remove code that would no longer be needed. If there's agreement to take this new approach, I'll add patches to remove the unneeded code. This patch is built against linux-next20230704. arch/x86/hyperv/ivm.c | 3 ++- arch/x86/kernel/sev.c | 2 +- arch/x86/mm/pat/set_memory.c | 32 ++++++++++++-------------------- 3 files changed, 15 insertions(+), 22 deletions(-) diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c index 28be6df..2859ec3 100644 --- a/arch/x86/hyperv/ivm.c +++ b/arch/x86/hyperv/ivm.c @@ -308,7 +308,8 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo return false; for (i = 0, pfn = 0; i < pagecount; i++) { - pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE); + pfn_array[pfn] = slow_virt_to_phys((void *)kbuffer + + i * HV_HYP_PAGE_SIZE) >> HV_HYP_PAGE_SHIFT; pfn++; if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) { diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c index 1ee7bed..59db55e 100644 --- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -784,7 +784,7 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long hdr->end_entry = i; if (is_vmalloc_addr((void *)vaddr)) { - pfn = vmalloc_to_pfn((void *)vaddr); + pfn = slow_virt_to_phys((void *)vaddr) >> PAGE_SHIFT; use_large_entry = false; } else { pfn = __pa(vaddr) >> PAGE_SHIFT; diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index bda9f12..8a194c7 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -2136,6 +2136,11 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc) if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr)) addr &= PAGE_MASK; + /* set_memory_np() removes aliasing mappings and flushes the TLB */ + ret = set_memory_np(addr, numpages); + if (ret) + return ret; + memset(&cpa, 0, sizeof(cpa)); cpa.vaddr = &addr; cpa.numpages = numpages; @@ -2143,36 +2148,23 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc) cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty); cpa.pgd = init_mm.pgd; - /* Must avoid aliasing mappings in the highmem code */ - kmap_flush_unused(); - vm_unmap_aliases(); - /* Flush the caches as needed before changing the encryption attribute. */ - if (x86_platform.guest.enc_tlb_flush_required(enc)) - cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required()); + if (x86_platform.guest.enc_cache_flush_required()) + cpa_flush(&cpa, 1); /* Notify hypervisor that we are about to set/clr encryption attribute. */ if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc)) return -EIO; ret = __change_page_attr_set_clr(&cpa, 1); - - /* - * After changing the encryption attribute, we need to flush TLBs again - * in case any speculative TLB caching occurred (but no need to flush - * caches again). We could just use cpa_flush_all(), but in case TLB - * flushing gets optimized in the cpa_flush() path use the same logic - * as above. - */ - cpa_flush(&cpa, 0); + if (ret) + return ret; /* Notify hypervisor that we have successfully set/clr encryption attribute. */ - if (!ret) { - if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc)) - ret = -EIO; - } + if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc)) + return -EIO; - return ret; + return set_memory_p(&addr, numpages); } static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) -- 1.8.3.1