On 09/09/20 10:27, Ard Biesheuvel wrote: > (adding Laszlo and Brijesh) > > On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <bp@xxxxxxxxx> wrote: >> >> + Ard so that he can ack the efi bits. >> >> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote: >>> From: Tom Lendacky <thomas.lendacky@xxxxxxx> >>> >>> Calling down to EFI runtime services can result in the firmware >>> performing VMGEXIT calls. The firmware is likely to use the GHCB of >>> the OS (e.g., for setting EFI variables), I've had to stare at this for a while. Because, normally a VMGEXIT is supposed to occur like this: - guest does something privileged - resultant non-automatic exit (NAE) injects a #VC exception - exception handler figures out what that privileged thing was - exception handler submits request to hypervisor via GHCB contents plus VMGEXIT instruction Point being, the agent that "owns" the exception handler is supposed to pre-allocate or otherwise provide the GHCB too, for information passing. So... what is the particular NAE that occurs during the execution of UEFI runtime services (at OS runtime)? And assuming it occurs, I'm unsure if the exception handler (IDT) at that point is owned (temporarily?) by the firmware. - If the #VC handler comes from the firmware, then I don't know why it would use the OS's GHCB. - If the #VC handler comes from the OS, then I don't understand why the commit message says "firmware performing VMGEXIT", given that in this case it would be the OS's #VC handler executing VMGEXIT. So, I think the above commit message implies a VMGEXIT *without* a NAE / #VC context. (Because, I fail to interpret the commit message in a NAE / #VC context in any way; see above.) OK, so let's see where the firmware performs a VMGEXIT *outside* of an exception handler, *while* at OS runtime. There seems to be one, in file "OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlashDxe.c": > VOID > QemuFlashPtrWrite ( > IN volatile UINT8 *Ptr, > IN UINT8 Value > ) > { > if (MemEncryptSevEsIsEnabled ()) { > MSR_SEV_ES_GHCB_REGISTER Msr; > GHCB *Ghcb; > > Msr.GhcbPhysicalAddress = AsmReadMsr64 (MSR_SEV_ES_GHCB); > Ghcb = Msr.Ghcb; > > // > // Writing to flash is emulated by the hypervisor through the use of write > // protection. This won't work for an SEV-ES guest because the write won't > // be recognized as a true MMIO write, which would result in the required > // #VC exception. Instead, use the the VMGEXIT MMIO write support directly > // to perform the update. > // > VmgInit (Ghcb); > Ghcb->SharedBuffer[0] = Value; > Ghcb->SaveArea.SwScratch = (UINT64) (UINTN) Ghcb->SharedBuffer; > VmgExit (Ghcb, SVM_EXIT_MMIO_WRITE, (UINT64) (UINTN) Ptr, 1); > VmgDone (Ghcb); > } else { > *Ptr = Value; > } > } This function *does* run at OS runtime (as a part of non-volatile UEFI variable writes). And note that, wherever MSR_SEV_ES_GHCB points to at the moment, is used as GHCB. If the guest kernel allocates its own GHCB and writes the allocation address to MSR_SEV_ES_GHCB, then indeed the firmware will use the GHCB of the OS. I reviewed edk2 commit 437eb3f7a8db ("OvmfPkg/QemuFlashFvbServicesRuntimeDxe: Bypass flash detection with SEV-ES", 2020-08-17), but I admit I never thought of the guest OS changing MSR_SEV_ES_GHCB. I'm sorry about that. As long as this driver is running before OS runtime (i.e., during the DXE and BDS phases), MSR_SEV_ES_GHCB is supposed to carry the value we set in "OvmfPkg/PlatformPei/AmdSev.c": > STATIC > VOID > AmdSevEsInitialize ( > VOID > ) > { > VOID *GhcbBase; > PHYSICAL_ADDRESS GhcbBasePa; > UINTN GhcbPageCount, PageCount; > RETURN_STATUS PcdStatus, DecryptStatus; > IA32_DESCRIPTOR Gdtr; > VOID *Gdt; > > if (!MemEncryptSevEsIsEnabled ()) { > return; > } > > PcdStatus = PcdSetBoolS (PcdSevEsIsEnabled, TRUE); > ASSERT_RETURN_ERROR (PcdStatus); > > // > // Allocate GHCB and per-CPU variable pages. > // Since the pages must survive across the UEFI to OS transition > // make them reserved. > // > GhcbPageCount = mMaxCpuCount * 2; > GhcbBase = AllocateReservedPages (GhcbPageCount); > ASSERT (GhcbBase != NULL); > > GhcbBasePa = (PHYSICAL_ADDRESS)(UINTN) GhcbBase; > > // > // Each vCPU gets two consecutive pages, the first is the GHCB and the > // second is the per-CPU variable page. Loop through the allocation and > // only clear the encryption mask for the GHCB pages. > // > for (PageCount = 0; PageCount < GhcbPageCount; PageCount += 2) { > DecryptStatus = MemEncryptSevClearPageEncMask ( > 0, > GhcbBasePa + EFI_PAGES_TO_SIZE (PageCount), > 1, > TRUE > ); > ASSERT_RETURN_ERROR (DecryptStatus); > } > > ZeroMem (GhcbBase, EFI_PAGES_TO_SIZE (GhcbPageCount)); > > PcdStatus = PcdSet64S (PcdGhcbBase, GhcbBasePa); > ASSERT_RETURN_ERROR (PcdStatus); > PcdStatus = PcdSet64S (PcdGhcbSize, EFI_PAGES_TO_SIZE (GhcbPageCount)); > ASSERT_RETURN_ERROR (PcdStatus); > > DEBUG ((DEBUG_INFO, > "SEV-ES is enabled, %lu GHCB pages allocated starting at 0x%p\n", > (UINT64)GhcbPageCount, GhcbBase)); > > AsmWriteMsr64 (MSR_SEV_ES_GHCB, GhcbBasePa); So what is the *actual* problem at OS runtime: - Is it that MSR_SEV_ES_GHCB still points at this PEI-phase *reserved* memory allocation (and so when QemuFlashPtrWrite() tries to access it during OS runtime, it doesn't have a runtime mapping for it)? - Or is it that the OS actively changes MSR_SEV_ES_GHCB, pointing to a memory area that the OS owns -- and *that* area is what QemuFlashPtrWrite() cannot access at OS runtime? The first problem statement does *not* seem to apply, given -- again -- that the commit message says, "firmware is likely to use the GHCB of the OS". So I think the second problem statement must apply. (I think the "reserved allocation" above is "reserved" only because we want to keep the OS out of it around the ExitBootServices() transition.) Back to the email: On 09/09/20 10:27, Ard Biesheuvel wrote: > On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <bp@xxxxxxxxx> wrote: >> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote: >>> so each GHCB in the system needs to be identity >>> mapped in the EFI page tables, as unencrypted, to avoid page faults. Not sure I agree about this, but at least it seems to confirm my understanding -- apparently the idea is, for the OS, to satisfy QemuFlashPtrWrite() in the firmware, by putting the "expected" mapping -- for wherever MSR_SEV_ES_GHCB is going to point to -- in place. >>> >>> Signed-off-by: Tom Lendacky <thomas.lendacky@xxxxxxx> >>> [ jroedel@xxxxxxx: Moved GHCB mapping loop to sev-es.c ] >>> Signed-off-by: Joerg Roedel <jroedel@xxxxxxx> > > > This looks like it is papering over a more fundamental issue: any > memory region that the firmware itself needs to access during the > execution of runtime services needs to be described in the UEFI memory > map, with the appropriate annotations so that the OS knows it should > include these in the EFI runtime page tables. So why has this been > omitted in this case? So yeah, the issue seems to be that the QemuFlashFvbServicesRuntimeDxe driver does not *own* the GHCB that it attempts to use at OS runtime. It doesn't know where MSR_SEV_ES_GHCB is going to point. Is QemuFlashFvbServicesRuntimeDxe permitted to change MSR_SEV_ES_GHCB *temporarily* at OS runtime? Because, in that case: - QemuFlashFvbServicesRuntimeDxe should allocate a Runtime Services Data block for GHCB when it starts up (if SEV-ES is active), - QemuFlashFvbServicesRuntimeDxe should register a SetVirtualAddressMap handler, and use EfiConvertPointer() from UefiRuntimeLib to convert the "runtime GHCB" address to virtual address, in that handler, - QemuFlashPtrWrite() should call EfiAtRuntime() from UefiRuntimeLib, and if the latter returns TRUE, then (a) use the runtime-converted address for populating the GHCB, and (b) temporarily swap MSR_SEV_ES_GHCB with the address of the self-allocated GHCB. (The MSR needs a *physical* address, so QemuFlashFvbServicesRuntimeDxe would have to remember / retain the original (physical) allocation address too.) If QemuFlashFvbServicesRuntimeDxe is not permitted to change MSR_SEV_ES_GHCB even temporarily (at OS runtime), then I think the approach proposed in this (guest kernel) patch is valid. Let me skim the code below... > > > >>> --- >>> arch/x86/boot/compressed/sev-es.c | 1 + >>> arch/x86/include/asm/sev-es.h | 2 ++ >>> arch/x86/kernel/sev-es.c | 30 ++++++++++++++++++++++++++++++ >>> arch/x86/platform/efi/efi_64.c | 10 ++++++++++ >>> 4 files changed, 43 insertions(+) >>> >>> diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c >>> index 45702b866c33..0a9a248ca33d 100644 >>> --- a/arch/x86/boot/compressed/sev-es.c >>> +++ b/arch/x86/boot/compressed/sev-es.c >>> @@ -12,6 +12,7 @@ >>> */ >>> #include "misc.h" >>> >>> +#include <asm/pgtable_types.h> >>> #include <asm/sev-es.h> >>> #include <asm/trapnr.h> >>> #include <asm/trap_pf.h> >>> diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h >>> index e919f09ae33c..cf1d957c7091 100644 >>> --- a/arch/x86/include/asm/sev-es.h >>> +++ b/arch/x86/include/asm/sev-es.h >>> @@ -102,11 +102,13 @@ static __always_inline void sev_es_nmi_complete(void) >>> if (static_branch_unlikely(&sev_es_enable_key)) >>> __sev_es_nmi_complete(); >>> } >>> +extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd); >>> #else >>> static inline void sev_es_ist_enter(struct pt_regs *regs) { } >>> static inline void sev_es_ist_exit(void) { } >>> static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; } >>> static inline void sev_es_nmi_complete(void) { } >>> +static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; } >>> #endif >>> >>> #endif >>> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c >>> index 9ab3a4dfecd8..4e2b7e4d9b87 100644 >>> --- a/arch/x86/kernel/sev-es.c >>> +++ b/arch/x86/kernel/sev-es.c >>> @@ -491,6 +491,36 @@ int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) >>> return 0; >>> } >>> >>> +/* >>> + * This is needed by the OVMF UEFI firmware which will use whatever it finds in >>> + * the GHCB MSR as its GHCB to talk to the hypervisor. So make sure the per-cpu >>> + * runtime GHCBs used by the kernel are also mapped in the EFI page-table. Yup, this pretty much confirms my suspicion that QemuFlashPtrWrite() is at the center of this. (BTW, I don't think that the runtime services data allocation, in QemuFlashFvbServicesRuntimeDxe, for OS runtime GHCB purposes, would have to be "per CPU". Refer to "Table 35. Rules for Reentry Into Runtime Services" in the UEFI spec -- if one processor is executing SetVariable(), then no other processor must enter SetVariable(). And so we'll have *at most* one VCPU in QemuFlashPtrWrite(), at any time.) >>> + */ >>> +int __init sev_es_efi_map_ghcbs(pgd_t *pgd) >>> +{ >>> + struct sev_es_runtime_data *data; >>> + unsigned long address, pflags; >>> + int cpu; >>> + u64 pfn; >>> + >>> + if (!sev_es_active()) >>> + return 0; >>> + >>> + pflags = _PAGE_NX | _PAGE_RW; >>> + >>> + for_each_possible_cpu(cpu) { >>> + data = per_cpu(runtime_data, cpu); >>> + >>> + address = __pa(&data->ghcb_page); >>> + pfn = address >> PAGE_SHIFT; >>> + >>> + if (kernel_map_pages_in_pgd(pgd, pfn, address, 1, pflags)) >>> + return 1; >>> + } >>> + >>> + return 0; >>> +} >>> + >>> static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt) >>> { >>> struct pt_regs *regs = ctxt->regs; >>> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c >>> index 6af4da1149ba..8f5759df7776 100644 >>> --- a/arch/x86/platform/efi/efi_64.c >>> +++ b/arch/x86/platform/efi/efi_64.c >>> @@ -47,6 +47,7 @@ >>> #include <asm/realmode.h> >>> #include <asm/time.h> >>> #include <asm/pgalloc.h> >>> +#include <asm/sev-es.h> >>> >>> /* >>> * We allocate runtime services regions top-down, starting from -4G, i.e. >>> @@ -229,6 +230,15 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) >>> return 1; >>> } >>> >>> + /* >>> + * When SEV-ES is active, the GHCB as set by the kernel will be used >>> + * by firmware. Create a 1:1 unencrypted mapping for each GHCB. >>> + */ >>> + if (sev_es_efi_map_ghcbs(pgd)) { >>> + pr_err("Failed to create 1:1 mapping for the GHCBs!\n"); >>> + return 1; >>> + } >>> + >>> /* >>> * When making calls to the firmware everything needs to be 1:1 >>> * mapped and addressable with 32-bit pointers. Map the kernel Good point! And it even makes me wonder if the QemuFlashFvbServicesRuntimeDxe approach, with the runtime services data type memory allocation, is feasible at all. Namely, a page's encryption status, under SEV, is controlled through the PTE. And for this particular UEFI runtime area, it would *not* suffice for the OS to just virt-map it. The OS would also have to *decrypt* the area (mark the PTE as "plaintext"). In other words, it would be an "unprecedented" PTE for the OS to set up: the PTE would not only map the GVA to GPA, but also mark the area as "plaintext". Otherwise -- if the OS covers *just* the virt-mapping --, QemuFlashFvbServicesRuntimeDxe would populate its own "runtime GHCB" area just fine, but the actual data hitting the host RAM would be encrypted. And so the hypervisor could not interpret the GHCB. *If* QemuFlashFvbServicesRuntimeDxe should not change the kernel-owned PTE at runtime, even temporarily, for marking the GHCB as "plaintext", then the problem is indeed only solvable in the guest kernel, in my opinion. There simply isn't an "architected annotation" for telling the kernel, "virt-map this runtime services data type memory range, *and* mark it as plaintext at the same time". This would be necessary, as both actions affect the exact same PTE, and the firmware is not really allowed to touch the PTE at runtime. But we don't have such a hint. To summarize: for QemuFlashFvbServicesRuntimeDxe to allocate UEFI Runtime Services Data type memory, for its own runtime GHCB, two permissions are necessary (together), at OS runtime: - QemuFlashFvbServicesRuntimeDxe must be allowed to swap MSR_SEV_ES_GHCB temporarily (before executing VMGEXIT), - QemuFlashFvbServicesRuntimeDxe must be allowed to change the OS-owned PTE temporarily (for remapping the GHCB as plaintext, before writing to it). Thanks Laszlo