On Wed, 9 Sep 2020 at 16:49, Tom Lendacky <thomas.lendacky@xxxxxxx> wrote: > > On 9/9/20 7:44 AM, Laszlo Ersek wrote: > > On 09/09/20 10:27, Ard Biesheuvel wrote: > >> (adding Laszlo and Brijesh) > >> > >> On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <bp@xxxxxxxxx> wrote: > >>> > >>> + Ard so that he can ack the efi bits. > >>> > >>> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote: > >>>> From: Tom Lendacky <thomas.lendacky@xxxxxxx> > >>>> > >>>> Calling down to EFI runtime services can result in the firmware > >>>> performing VMGEXIT calls. The firmware is likely to use the GHCB of > >>>> the OS (e.g., for setting EFI variables), > > > > I've had to stare at this for a while. > > > > Because, normally a VMGEXIT is supposed to occur like this: > > > > - guest does something privileged > > - resultant non-automatic exit (NAE) injects a #VC exception > > - exception handler figures out what that privileged thing was > > - exception handler submits request to hypervisor via GHCB contents plus > > VMGEXIT instruction > > > > Point being, the agent that "owns" the exception handler is supposed to > > pre-allocate or otherwise provide the GHCB too, for information passing. > > > > So... what is the particular NAE that occurs during the execution of > > UEFI runtime services (at OS runtime)? > > > > And assuming it occurs, I'm unsure if the exception handler (IDT) at > > that point is owned (temporarily?) by the firmware. > > > > - If the #VC handler comes from the firmware, then I don't know why it > > would use the OS's GHCB. > > > > - If the #VC handler comes from the OS, then I don't understand why the > > commit message says "firmware performing VMGEXIT", given that in this > > case it would be the OS's #VC handler executing VMGEXIT. > > > > So, I think the above commit message implies a VMGEXIT *without* a NAE / > > #VC context. (Because, I fail to interpret the commit message in a NAE / > > #VC context in any way; see above.) > > Correct. > > > > > OK, so let's see where the firmware performs a VMGEXIT *outside* of an > > exception handler, *while* at OS runtime. There seems to be one, in file > > "OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlashDxe.c": > > Again, correct. Basically this is what is invoked when setting UEFI variables. > > > > >> VOID > >> QemuFlashPtrWrite ( > >> IN volatile UINT8 *Ptr, > >> IN UINT8 Value > >> ) > >> { > >> if (MemEncryptSevEsIsEnabled ()) { > >> MSR_SEV_ES_GHCB_REGISTER Msr; > >> GHCB *Ghcb; > >> > >> Msr.GhcbPhysicalAddress = AsmReadMsr64 (MSR_SEV_ES_GHCB); > >> Ghcb = Msr.Ghcb; > >> > >> // > >> // Writing to flash is emulated by the hypervisor through the use of write > >> // protection. This won't work for an SEV-ES guest because the write won't > >> // be recognized as a true MMIO write, which would result in the required > >> // #VC exception. Instead, use the the VMGEXIT MMIO write support directly > >> // to perform the update. > >> // > >> VmgInit (Ghcb); > >> Ghcb->SharedBuffer[0] = Value; > >> Ghcb->SaveArea.SwScratch = (UINT64) (UINTN) Ghcb->SharedBuffer; > >> VmgExit (Ghcb, SVM_EXIT_MMIO_WRITE, (UINT64) (UINTN) Ptr, 1); > >> VmgDone (Ghcb); > >> } else { > >> *Ptr = Value; > >> } > >> } > > > > This function *does* run at OS runtime (as a part of non-volatile UEFI > > variable writes). > > > > And note that, wherever MSR_SEV_ES_GHCB points to at the moment, is used > > as GHCB. > > > > If the guest kernel allocates its own GHCB and writes the allocation > > address to MSR_SEV_ES_GHCB, then indeed the firmware will use the GHCB > > of the OS. > > > > I reviewed edk2 commit 437eb3f7a8db > > ("OvmfPkg/QemuFlashFvbServicesRuntimeDxe: Bypass flash detection with > > SEV-ES", 2020-08-17), but I admit I never thought of the guest OS > > changing MSR_SEV_ES_GHCB. I'm sorry about that. > > > > As long as this driver is running before OS runtime (i.e., during the > > DXE and BDS phases), MSR_SEV_ES_GHCB is supposed to carry the value we > > set in "OvmfPkg/PlatformPei/AmdSev.c": > > > >> STATIC > >> VOID > >> AmdSevEsInitialize ( > >> VOID > >> ) > >> { > >> VOID *GhcbBase; > >> PHYSICAL_ADDRESS GhcbBasePa; > >> UINTN GhcbPageCount, PageCount; > >> RETURN_STATUS PcdStatus, DecryptStatus; > >> IA32_DESCRIPTOR Gdtr; > >> VOID *Gdt; > >> > >> if (!MemEncryptSevEsIsEnabled ()) { > >> return; > >> } > >> > >> PcdStatus = PcdSetBoolS (PcdSevEsIsEnabled, TRUE); > >> ASSERT_RETURN_ERROR (PcdStatus); > >> > >> // > >> // Allocate GHCB and per-CPU variable pages. > >> // Since the pages must survive across the UEFI to OS transition > >> // make them reserved. > >> // > >> GhcbPageCount = mMaxCpuCount * 2; > >> GhcbBase = AllocateReservedPages (GhcbPageCount); > >> ASSERT (GhcbBase != NULL); > >> > >> GhcbBasePa = (PHYSICAL_ADDRESS)(UINTN) GhcbBase; > >> > >> // > >> // Each vCPU gets two consecutive pages, the first is the GHCB and the > >> // second is the per-CPU variable page. Loop through the allocation and > >> // only clear the encryption mask for the GHCB pages. > >> // > >> for (PageCount = 0; PageCount < GhcbPageCount; PageCount += 2) { > >> DecryptStatus = MemEncryptSevClearPageEncMask ( > >> 0, > >> GhcbBasePa + EFI_PAGES_TO_SIZE (PageCount), > >> 1, > >> TRUE > >> ); > >> ASSERT_RETURN_ERROR (DecryptStatus); > >> } > >> > >> ZeroMem (GhcbBase, EFI_PAGES_TO_SIZE (GhcbPageCount)); > >> > >> PcdStatus = PcdSet64S (PcdGhcbBase, GhcbBasePa); > >> ASSERT_RETURN_ERROR (PcdStatus); > >> PcdStatus = PcdSet64S (PcdGhcbSize, EFI_PAGES_TO_SIZE (GhcbPageCount)); > >> ASSERT_RETURN_ERROR (PcdStatus); > >> > >> DEBUG ((DEBUG_INFO, > >> "SEV-ES is enabled, %lu GHCB pages allocated starting at 0x%p\n", > >> (UINT64)GhcbPageCount, GhcbBase)); > >> > >> AsmWriteMsr64 (MSR_SEV_ES_GHCB, GhcbBasePa); > > > > So what is the *actual* problem at OS runtime: > > > > - Is it that MSR_SEV_ES_GHCB still points at this PEI-phase *reserved* > > memory allocation (and so when QemuFlashPtrWrite() tries to access it > > during OS runtime, it doesn't have a runtime mapping for it)? > > At this point the GHCB MSR points to the OS GHCB, which isn't mapped by > the page tables supplied by the OS and used by UEFI. > > > > > - Or is it that the OS actively changes MSR_SEV_ES_GHCB, pointing to a > > memory area that the OS owns -- and *that* area is what > > QemuFlashPtrWrite() cannot access at OS runtime? > > Correct. > > > > > The first problem statement does *not* seem to apply, given -- again -- > > that the commit message says, "firmware is likely to use the GHCB of the > > OS". > > > > So I think the second problem statement must apply. > > > > (I think the "reserved allocation" above is "reserved" only because we > > want to keep the OS out of it around the ExitBootServices() transition.) > > > > Back to the email: > > > > On 09/09/20 10:27, Ard Biesheuvel wrote: > >> On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <bp@xxxxxxxxx> wrote: > >>> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote: > >>>> so each GHCB in the system needs to be identity > >>>> mapped in the EFI page tables, as unencrypted, to avoid page faults. > > > > Not sure I agree about this, but at least it seems to confirm my > > understanding -- apparently the idea is, for the OS, to satisfy > > QemuFlashPtrWrite() in the firmware, by putting the "expected" mapping > > -- for wherever MSR_SEV_ES_GHCB is going to point to -- in place. > > > >>>> > >>>> Signed-off-by: Tom Lendacky <thomas.lendacky@xxxxxxx> > >>>> [ jroedel@xxxxxxx: Moved GHCB mapping loop to sev-es.c ] > >>>> Signed-off-by: Joerg Roedel <jroedel@xxxxxxx> > >> > >> > >> This looks like it is papering over a more fundamental issue: any > >> memory region that the firmware itself needs to access during the > >> execution of runtime services needs to be described in the UEFI memory > >> map, with the appropriate annotations so that the OS knows it should > >> include these in the EFI runtime page tables. So why has this been > >> omitted in this case? > > > > So yeah, the issue seems to be that the QemuFlashFvbServicesRuntimeDxe > > driver does not *own* the GHCB that it attempts to use at OS runtime. It > > doesn't know where MSR_SEV_ES_GHCB is going to point. > > > > Is QemuFlashFvbServicesRuntimeDxe permitted to change MSR_SEV_ES_GHCB > > *temporarily* at OS runtime? > > > > Because, in that case: > > > > - QemuFlashFvbServicesRuntimeDxe should allocate a Runtime Services Data > > block for GHCB when it starts up (if SEV-ES is active), > > > > - QemuFlashFvbServicesRuntimeDxe should register a SetVirtualAddressMap > > handler, and use EfiConvertPointer() from UefiRuntimeLib to convert > > the "runtime GHCB" address to virtual address, in that handler, > > > > - QemuFlashPtrWrite() should call EfiAtRuntime() from UefiRuntimeLib, > > and if the latter returns TRUE, then (a) use the runtime-converted > > address for populating the GHCB, and (b) temporarily swap > > MSR_SEV_ES_GHCB with the address of the self-allocated GHCB. (The MSR > > needs a *physical* address, so QemuFlashFvbServicesRuntimeDxe would > > have to remember / retain the original (physical) allocation address > > too.) > > > > If QemuFlashFvbServicesRuntimeDxe is not permitted to change > > MSR_SEV_ES_GHCB even temporarily (at OS runtime), then I think the > > approach proposed in this (guest kernel) patch is valid. > > > > Let me skim the code below... > > > >> > >> > >> > >>>> --- > >>>> arch/x86/boot/compressed/sev-es.c | 1 + > >>>> arch/x86/include/asm/sev-es.h | 2 ++ > >>>> arch/x86/kernel/sev-es.c | 30 ++++++++++++++++++++++++++++++ > >>>> arch/x86/platform/efi/efi_64.c | 10 ++++++++++ > >>>> 4 files changed, 43 insertions(+) > >>>> > >>>> diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c > >>>> index 45702b866c33..0a9a248ca33d 100644 > >>>> --- a/arch/x86/boot/compressed/sev-es.c > >>>> +++ b/arch/x86/boot/compressed/sev-es.c > >>>> @@ -12,6 +12,7 @@ > >>>> */ > >>>> #include "misc.h" > >>>> > >>>> +#include <asm/pgtable_types.h> > >>>> #include <asm/sev-es.h> > >>>> #include <asm/trapnr.h> > >>>> #include <asm/trap_pf.h> > >>>> diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h > >>>> index e919f09ae33c..cf1d957c7091 100644 > >>>> --- a/arch/x86/include/asm/sev-es.h > >>>> +++ b/arch/x86/include/asm/sev-es.h > >>>> @@ -102,11 +102,13 @@ static __always_inline void sev_es_nmi_complete(void) > >>>> if (static_branch_unlikely(&sev_es_enable_key)) > >>>> __sev_es_nmi_complete(); > >>>> } > >>>> +extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd); > >>>> #else > >>>> static inline void sev_es_ist_enter(struct pt_regs *regs) { } > >>>> static inline void sev_es_ist_exit(void) { } > >>>> static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; } > >>>> static inline void sev_es_nmi_complete(void) { } > >>>> +static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; } > >>>> #endif > >>>> > >>>> #endif > >>>> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c > >>>> index 9ab3a4dfecd8..4e2b7e4d9b87 100644 > >>>> --- a/arch/x86/kernel/sev-es.c > >>>> +++ b/arch/x86/kernel/sev-es.c > >>>> @@ -491,6 +491,36 @@ int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) > >>>> return 0; > >>>> } > >>>> > >>>> +/* > >>>> + * This is needed by the OVMF UEFI firmware which will use whatever it finds in > >>>> + * the GHCB MSR as its GHCB to talk to the hypervisor. So make sure the per-cpu > >>>> + * runtime GHCBs used by the kernel are also mapped in the EFI page-table. > > > > Yup, this pretty much confirms my suspicion that QemuFlashPtrWrite() is > > at the center of this. > > > > (BTW, I don't think that the runtime services data allocation, in > > QemuFlashFvbServicesRuntimeDxe, for OS runtime GHCB purposes, would have > > to be "per CPU". Refer to "Table 35. Rules for Reentry Into Runtime > > Services" in the UEFI spec -- if one processor is executing > > SetVariable(), then no other processor must enter SetVariable(). And so > > we'll have *at most* one VCPU in QemuFlashPtrWrite(), at any time.) > > > >>>> + */ > >>>> +int __init sev_es_efi_map_ghcbs(pgd_t *pgd) > >>>> +{ > >>>> + struct sev_es_runtime_data *data; > >>>> + unsigned long address, pflags; > >>>> + int cpu; > >>>> + u64 pfn; > >>>> + > >>>> + if (!sev_es_active()) > >>>> + return 0; > >>>> + > >>>> + pflags = _PAGE_NX | _PAGE_RW; > >>>> + > >>>> + for_each_possible_cpu(cpu) { > >>>> + data = per_cpu(runtime_data, cpu); > >>>> + > >>>> + address = __pa(&data->ghcb_page); > >>>> + pfn = address >> PAGE_SHIFT; > >>>> + > >>>> + if (kernel_map_pages_in_pgd(pgd, pfn, address, 1, pflags)) > >>>> + return 1; > >>>> + } > >>>> + > >>>> + return 0; > >>>> +} > >>>> + > >>>> static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt) > >>>> { > >>>> struct pt_regs *regs = ctxt->regs; > >>>> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c > >>>> index 6af4da1149ba..8f5759df7776 100644 > >>>> --- a/arch/x86/platform/efi/efi_64.c > >>>> +++ b/arch/x86/platform/efi/efi_64.c > >>>> @@ -47,6 +47,7 @@ > >>>> #include <asm/realmode.h> > >>>> #include <asm/time.h> > >>>> #include <asm/pgalloc.h> > >>>> +#include <asm/sev-es.h> > >>>> > >>>> /* > >>>> * We allocate runtime services regions top-down, starting from -4G, i.e. > >>>> @@ -229,6 +230,15 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) > >>>> return 1; > >>>> } > >>>> > >>>> + /* > >>>> + * When SEV-ES is active, the GHCB as set by the kernel will be used > >>>> + * by firmware. Create a 1:1 unencrypted mapping for each GHCB. > >>>> + */ > >>>> + if (sev_es_efi_map_ghcbs(pgd)) { > >>>> + pr_err("Failed to create 1:1 mapping for the GHCBs!\n"); > >>>> + return 1; > >>>> + } > >>>> + > >>>> /* > >>>> * When making calls to the firmware everything needs to be 1:1 > >>>> * mapped and addressable with 32-bit pointers. Map the kernel > > > > Good point! > > > > And it even makes me wonder if the QemuFlashFvbServicesRuntimeDxe > > approach, with the runtime services data type memory allocation, is > > feasible at all. Namely, a page's encryption status, under SEV, is > > controlled through the PTE. > > > > And for this particular UEFI runtime area, it would *not* suffice for > > the OS to just virt-map it. The OS would also have to *decrypt* the area > > (mark the PTE as "plaintext"). > > > > In other words, it would be an "unprecedented" PTE for the OS to set up: > > the PTE would not only map the GVA to GPA, but also mark the area as > > "plaintext". > > > > Otherwise -- if the OS covers *just* the virt-mapping --, > > QemuFlashFvbServicesRuntimeDxe would populate its own "runtime GHCB" > > area just fine, but the actual data hitting the host RAM would be > > encrypted. And so the hypervisor could not interpret the GHCB. > > > > *If* QemuFlashFvbServicesRuntimeDxe should not change the kernel-owned > > PTE at runtime, even temporarily, for marking the GHCB as "plaintext", > > then the problem is indeed only solvable in the guest kernel, in my > > opinion. > > > > There simply isn't an "architected annotation" for telling the kernel, > > "virt-map this runtime services data type memory range, *and* mark it as > > plaintext at the same time". > > > > This would be necessary, as both actions affect the exact same PTE, and > > the firmware is not really allowed to touch the PTE at runtime. But we > > don't have such a hint. > > > > > > To summarize: for QemuFlashFvbServicesRuntimeDxe to allocate UEFI > > Runtime Services Data type memory, for its own runtime GHCB, two > > permissions are necessary (together), at OS runtime: > > > > - QemuFlashFvbServicesRuntimeDxe must be allowed to swap MSR_SEV_ES_GHCB > > temporarily (before executing VMGEXIT), > > > > - QemuFlashFvbServicesRuntimeDxe must be allowed to change the OS-owned > > PTE temporarily (for remapping the GHCB as plaintext, before writing > > to it). > > > > Amazing summarization Laszlo! > Agreed. And thanks a lot for taking the time to do the analysis. Based on the above, Acked-by: Ard Biesheuvel <ardb@xxxxxxxxxx> Thanks,