Hi, This RFC series are KVM part of Basic KVM SGX virtualization support (KVM SGX EPC static partitioning + Launch Control + SGX2 support). Qemu also needs to be changed to support KVM SGX virtualization and Qemu part will be sent out separately in the future. You can also find this series and Qemu changes at below github repos: https://github.com/01org/kvm-sgx.git https://github.com/01org/qemu-sgx.git KVM SGX virtualization needs to work with host SGX driver (explained below, which has not been upstreamed yet), therefore part of this series will depend on SGX driver. You can find the SGX driver at below repo on github. https://github.com/jsakkine-intel/linux-sgx The SGX specification can be found in latest Intel SDM as Volume D(below). https://software.intel.com/sites/default/files/managed/7c/f1/332831-sdm-vol-3d.pdf SGX is relatively more complicated on specification (entire Volume D) and it is unrealistic to list all hardware details here. Below is the brief SGX overview (which I think is mandatory to talk about design) and high level design. Please help to review and give comments. Thanks! ============================ SGX Overview =========================== - Enclave Intel Software Guard Extensions (SGX) is a set of instructions and mechanisms for memory accesses in order to provide security accesses for sensitive applications and data. SGX allows an application to use it's pariticular address space as an *enclave*, which is a protected area provides confidentiality and integrity even in the presence of privileged malware. Accesses to the enclave memory area from any software not resident in the enclave are prevented, including those from privileged software. Below diagram illustrates the presence of Enclave in application. |-----------------------| | | | |---------------| | | | OS kernel | | |-----------------------| | |---------------| | | | | | | | | |---------------| | | |---------------| | | | Entry table | | | | Enclave |---|-----> | |---------------| | | |---------------| | | | Enclave stack | | | | App code | | | |---------------| | | |---------------| | | | Enclave heap | | | | Enclave | | | |---------------| | | |---------------| | | | Enclave code | | | | App code | | | |---------------| | | |---------------| | | | | | | |-----------------------| |-----------------------| SGX supports SGX1 and SGX2 extensions. SGX1 provides basic enclave support, and SGX2 allows additional flexibility in runtime management of enclave resources and thread execution within an enclave. - Enclave Page Cache Enclave Page Cache (EPC) is the physical resource used to commit to enclave. EPC is divided to 4K pages. An EPC page is 4K in size and always aligned to 4K boundary. Hardware performs additional access control checks to restrict access to the EPC page. The Enclave Page Cache Map (EPCM) is a secure structure which holds one entry for each EPC page, and is used by hardware to track the status of each EPC page (invisibe to software). Typically EPC and EPCM are reserved by BIOS as Processor Reserved Memory but the actual amount, size, and layout of EPC are model-specific, and dependent on BIOS settings. EPC is enumerated via new SGX CPUID, and is reported as reserved memory. EPC pages can either be invalid or valid. There are 4 valid EPC types in SGX1: regular EPC page, SGX Enclave Control Structure (SECS) page, Thread Control Structure (TCS) page, and Version Array (VA) page. SGX2 adds Trimmed EPC page. Each enclave is associated with one SECS page. Each thread in enclave is associated with one TCS page. VA page is used in EPC page eviction and reload. Trimmed EPC page is used when particular 4K page in enclave is going to be freed (trimmed). - ENCLS and ENCLU Two new instructions ENCLS and ENCLU are introduced to manage enclave and EPC. ENCLS can only run in ring 0, while ENCLU can only run in ring 3. Both ENCLS and ENCLU have multiple leaf functions, with EAX indicating the specific leaf function. Specification of ENCLS and ENCLU can be found at SDM Chapter 41 SGX Instruction References. - Discovering SGX capability CPUID.0x7.0:EBX.SGX[bit 2] reports the availability of SGX, and detailed SGX info can be enumerated via new CPUID 0x12. CPUID.0x12.0x0 enumerates SGX capablity (ex, SGX1, SGX2), including enclave instruction opcode support. CPUID.0x12.0x1 enumerates SGX capability of processor state configuration and enclave configuration in the SECS structure. CPUID.0x12.0x2 (and following indexes if they are valid) enumerates EPC resources. Starting from CPUID.0x12.0x2, each index reports one valid EPC section (base, size), until CPUID reports invalid EPC. Typically multiple EPC sections only exist on multiple sockets server machines (which currently don't exist), and client machine or single socket server just reports one EPC. Please refer to Chapter 37.7.2 Intel SGX Resource Enumeration Leaves for detailed info of SGX CPUID. On processor that supports SGX, SGX can also be opt-in{out} via SGX_ENABLE bit (bit 18) of IA32_FEATURE_CONTROL MSR. If SGX_ENABLE bit is cleared while IA32_FEATURE_CONTROL is locked then SGX is disabled on processor. The SGX CPUID 0x12 is still available if SGX is opted out via IA32_FEATURE_CONTROL. The SDM doesn't specify the exact info that SGX CPUID 0x12 will report in this case, but likely they will report invalid SGX info. If SGX is opted in, then SGX CPUID 0x12 reports valid SGX info. ENCLS and ENCLU will either #UD or #GP, depending on the value of CPUID.0x7.0:EBX.SGX, IA32_FEATURE_CONTROL.SGX_ENABLE and IA32_FEATURE_CONTROL.LOCK. Please refer to Chapter 37.7.1 Intel SGX Opt-in Configuration for detailed info. - SGX Launch Control On processor that supports SGX, IA32_SGXLEPUBKEYHASH[0-3] MSRs contains the hash of RSA public key. The Launch Enclave (LE) can be only run if it is signed with the related RSA private key. Without SGX Launch Control, hardware can only run Launch Enclave (LE) that signed with Intel's RSA key. SGX Launch Control allows software to be able to change IA32_SGXLEPUBKEYHASHn at runtime, allowing processor to run 3rd party's LE. SGX Launch Control adds a new SGX_LAUNCH_CONTROL_ENABLE bit (bit 17) to IA32_FEATURE_CONTROL MSR. If SGX_LAUNCH_CONTROL_ENABLE[bit 17] is 1, IA32_SGXLEPUBKEYHASHn are writable at runtime after IA32_FEATURE_CONTROL.LOCK is set. Otherwise they are readonly. Typically BIOS allows user to setup 3rd party's IA32_SGXLEPUBKEYHASHn before IA32_FEATURE_CONTROL is locked, and allows user to choose whether to allow IA32_SGXLEPUBKEYHASHn to be changed at runtime as well. However this depends on BIOS's implementation. The CPUID.0x7.0:ECX[bit 30] reports availability of bit 17 of IA32_FEATURE_CONTROL, meaning processor only support SGX Launch Policy when CPUID.0x7.0:ECX[bit 30] is 1. - SGX interaction with VMX A new 64-bit ENCLS-exiting bitmap control field is added to VMCS (encoding 0202EH) to control VMEXIT on ENCLS leaf functions. And a new "Enable ENCLS exiting" control bit (bit 15) is defined in secondary processor based vm execution control. 1-Setting of "Enable ENCLS exiting" enables ENCLS-exiting bitmap control. Support for the 1-setting of "Enable ENCLS exiting" control is enumrated from IA32_VMX_PROCBASED_CTLS2[bit 47]. IA32_VMX_PROCBASED_CTLS2[bit 47] monitors CPUID.[EAX=0x7,ECX=0].EBX.SGX. A new ENCLS VM exit reason (60) is also defined to Basic Exit Reason. Below code shows how above execution control works: IF ( (in VMX non-root operation) and ( Enable_ENCLS_EXITING = 1) ) Then IF ( ((EAX < 63) and (ENCLS_EXITING_Bitmap[EAX] = 1)) or (EAX> 62 and ENCLS_EXITING_Bitmap[63] = 1) ) Then Set VMCS.EXIT_REASON = ENCLS; Deliver VM exit; FI; FI; VM exits that originate within an enclave set the following two bits before delivering the VM exit to the VMM: - Bit 27 in the Exit reason filed of Basic VM-exit information. - Bit 4 in the Interruptibility State of Guest Non-Register State of VMCS. Refer to 42.5 Interactions with VMX, 27.2.1 Basic VM-Exit Information, and 27.3.4 Saving Non-Register. ========================= High Level Design ========================== - Qemu Changes EPC is limited resource. Typically the EPC and EPCM together are 32M, 64M, or 128M configurable in BIOS. In order to use EPC more efficiently between different KVM guests, we add additional Qemu parameters to allow administrator to specify guest's EPC size when it is created. we also add additional two parameters for SGX Launch Control. Specifically, below SGX parameters are added: # qemu-system-x86_64 -sgx epc=<size>,lehash='256-bit value string',lewr In which 'epc' parameter specifies guest's EPC size. Any MB aligned value is supported. 'lehash' is used to specify guest's IA32_SGXLEPUBKEYHASHn initial value, and 'lewr' is used to specify whether guest's IA32_SGXLEPUBKEYHASHn are writable for guest OS. 'epc' is mandatory and both 'lehash' and 'lewr' are optional. Normally with 'lewr' specified, 'lehash' is not needed (and default value is Intel's hash) as guest OS is able to change IA32_SGXLEPUBKEYHASHn as it wishs. With 'epc' parameter, Qemu is responsible for notifying KVM guest's EPC base and size. EPC base address will be calculated by Qemu internally (according to chip type, memory size, etc). With 'lehash' specified, Qemu sets guest's IA32_SGXLEPUBKEYHASHn to the value specified. With 'lewr' specified, Qemu sets guest's IA32_FEATURE_CONTROL bit 17 to be 1. - Expose SGX to guest SGX feature is exposed to guest via SGX CPUID. Looking at SGX CPUID, we can report the same CPUID info to guest as on native for most of SGX CPUID. With reporting the same CPUID guest is able to use full capacity of SGX, and KVM doesn't need to emulate those info. There are two exceptions: the first is obviously KVM cannot report physical EPC to guest, but should report guest's (virtual) EPC base and size (which will be notified from Qemu as we mentioned above). The second one is SECS.ATTRIBUTES, which is reported by CPUID.0x12.1:EAX-EDX. Particularly, it is SECS.ATTRIBUTES.XFRM(bit 127:64] that needs emulation. It reports which XFRM bits can be set when creating enclave by using ENCLS[ECREATE]. As guest may not support all XFRM bits that supported by hardware, CPUID.0x12.0x1:[ECX-EDX] should also only reports guest's supported XFRM bits. All other CPUID info can be reported to guest just as the same as on native. And we only report one EPC section to guset (only CPUID.0x12.0x2 is valid). - Initializing SGX for guest As mentioned above guest's EPC base and size are determined by Qemu, and KVM needs Qemu to notify such info to it before it can initialize SGX for guest. To avoid new IOCTL for such purpose (ex, KVM_SET_EPC), KVM will initialize guest's SGX in KVM_SET_CPUID2, where Qemu will pass guest's SGX CPUID where guest's EPC base and size will be included. Also the SDM says SGX CPUID is actually thread-specific. Software cannot assume all logical processor will report the same SGX CPUID. Initializing guest's SGX in KVM_SET_CPUID2 provides an opportunity for KVM to check whether SGX CPUID passed by Qemu are valid and consistent within for all VCPUs. - EPC management On host side there's SGX driver which serves host SGX applications from userspace. It detects SGX features and manages all EPC pages. To work with SGX driver simultaneously, we have to use 'unified model', in which SGX driver still manages EPC and KVM calls driver's APIs to allocate/free EPC page, etc. However KVM cannot call driver's APIs directly, as on machines without SGX feature, SGX driver won't be loaded, and calling driver's APIs directly will make KVM unable to be loaded either. Instead, KVM uses symbol_get to get driver's APIs at runtime to avoids this issue. For KVM guests, there are two approaches in terms of managing EPC: static partitioning and oversubscription. In static partitioning all EPC pages are are allocated to guest when it is created and are freed only when guest is destroyed. In oversubscription, EPC pages are allocated to guest on demand, and EPC pages allocated to guest can be evicted out by KVM, and reassigned to other guests. Accessing to guest EPC page where there's no physical EPC mapped causes EPT violation (or PF in case of shadowing), in which physical EPC page will be allocated to guest (and reloaded to enclave if required). -- Static partitioning Static partitioning is an simple appproach. KVM only needs to allocate all EPC pages when guest is created and set up mapping. All ENCLS leaf functions will run perfectly in guest, so KVM doesn't need to turn on ENCLS VMEXIT. However KVM needs to turn on ENCLS VMEXIT if KVM doesn't expose SGX to guest, or guest has turned off SGX via IA32_FEATURE_CONTROL.SGX_ENABLE, as in such cases ENCLS run in guest may have different behavior from on native, as on hardware SGX is indeed enabled, but accroding to SDM, running ENCLS in guest while SGX environment is abnormal in guest should cause #UD or #GP. KVM needs to trap ENCLS to emulate such behavior. -- Oversubscription While oversubscription is better in terms of functionality, it needs more complicated implementation. Below is the brief explanation of what needs to be done in order to support EPC oversubscription between guests. Below is the sequence to evict regular EPC page: 1) Select one or multiple regular EPC pages from one enclave 2) Remove EPT/PT mapping for selected EPC pages 3) Send IPIs to remote CPUs to flush TLB of selected EPC pages 4) EBLOCK on selected EPC pages 5) ETRACK on enclave's SECS page 6) allocate one available slot (8-byte) in VA page 7) EWB on selected EPC pages With EWB taking: - VA slot, to restore eviction version info. - one normal 4K page in memory, to store encrypted content of EPC page. - one struct PCMD in memory, to store meta data. And below is the sequence to evict an SECS page or VA page: 1) locate SECS (or VA) page 2) remove EPT/PT mapping for SECS (or VA) page 3) Send IPIs to remote CPUs 6) allocate one available slot (8-byte) in VA page 4) EWB on SECS (or) page And for evicting SECS page, all regular EPC pages that belongs to that SECS must be evicted out prior, otherwise EWB returns SGX_CHILD_PRESENT error. And to reload an EPC page: 1) ELDU/ELDB on EPC page 2) setup EPT/PT mapping With ELDU/ELDB taking: - location of SECS page - linear address of enclave's 4K page (that we are going to reload to) - VA slot (used in EWB) - 4K page in memory (used in EWB) - struct PCMD in memory (used in EWB) Therefore, to support EPC oversubscription for guests, KVM needs to know: 1) EPC page type (SECS, regular page, VA page, etc) 2) EPC status (whether blocked) -- guest may already have run EBLOCK 3) location of SECS page -- both eviction & reload need it. Besides above, KVM also needs to manage allocation of VA slot, which itself is also EPC page and could potentially trigger EPC oversubscription. To get above info, KVM needs to trap ENCLS from all guests, and maintain info of all EPC pages and all enclaves from all guests. Specifically, KVM needs to turn on ENCLS VMEXIT for all guests, and upon ENCLS VMEXIT, KVM needs to parse ENCLS parameters (so that we can update EPC/enclave info according to which ENCLS leaf guest is running, and it's parameters). KVM also needs to either run ENCLS on behalf of guest (and skip this ENCLS), or using MTF to return to guest and let guest run this ENCLS again. For the formar, KVM needs to reconstruct guest's ENCLS parameters and remap guest's virtual address to KVM kernel address (as all addresses in guest's ENCLS parameters are guest virtual address), and run ENCLS in KVM on behalf for guest. For the latter, upon ENCLS VMEXIT, KVM needs to temporary turn off ENCLS VMEXIT, turn on MTF VMEXIT, and enter guest to allow guest run this ENCLS again. This time ENCLS VMEXIT won't happen and MTF VMEXIT will happen after ENCLS is executed. Upon MTF VMEXIT, we turn on ENCLS VMEXIT and turn off MTF VMEXIT again. Below diagrams compares the two approaches: Run ENCLS in KVM, and Using MTF. -------------------------------------------------------------- | ENCLS | -------------------------------------------------------------- | /|\ ENCLS VMEXIT | | VMENTRY | | \|/ | 1) parse ENCLS parameters 2) reconstruct(remap) guest's ENCLS parameters 3) run ENCLS on behalf of guest (and skip ENCLS) 4) on success, update EPC/enclave info, or inject error 1) Run ENCLS in KVM -------------------------------------------------------------- | ENCLS | -------------------------------------------------------------- | /|\ |/|\ ENCLS | | VMENTRY MTF | | VMENTRY VMEXIT | | VMEXIT | | \|/ | \|/| 1) Turn off EMCLS VMEXIT 1) Turn off MTF VMEXIT 2) turn on MTF VMEXIT 2) Turn on ENCLS VMEXIT 3) cache ENCLS parameters 3) check ENCLS succeeds or not, and (as ENCLS will change RAX-RDX) only on success, parse cached ENCLS parameters, and update EPC/enclave info 2) Using MTF Note in using MTF, checking ENCLS status (whether succeeds or not) is tricky, as ENCLS can both return error via EAX register, or just cause #UD or #GP. For the formar case it's relatively easier for KVM to check but for the latter KVM needs to trap #UD and #GP from guest, and also needs to check whether the #UD or #GP happened while running ENCLS. In this patch series we only support 'static partitioning'. 'oversubscription' can be supported when it is required. Currently we do support nested SGX (mentioned below) and in 'oversubscription' supporting nested SGX will be very complicated. - Guest's EPC memory slot implementation Guest's (virtual) EPC is implemented as private memory slot in KVM. Qemu will not be aware the existence of such EPC slot. Using private slot, we can avoid mmap in Qemu for getting EPC slot's host virtual address, and KVM doesn't need to handle such mmap from Qemu for EPC slot. And we don't want to implement such mmap support in SGX driver either. A dedicated kvm_epc_ops is added for VMA of EPC slot, and EPC page will be allocated via vma->vm_ops->fault. This is the natual way to support 'oversubscription' (if we need to support in the future) and works for 'static partitioning' nicely as well. - Nested SGX Currently for 'static partitioning' nested SGX is also supported. As mentioned above in normal case KVM (L0) doesn't need to turn on ENCLS VMEXIT, but KVM cannot assume L1 hypervisor's behavior, so if ENCLS VMEXIT is turned on in L1, KVM (L0) must also turn on ENCLS VMEXIT but let L1 to handle such ENCLS VMEXIT from L2 guest. Supporting nested SGX in 'oversubscription' will be very complicated, as both L0 and L1 may turn on ENCLS VMEXIT, and both L0 and L1 needs to maintain and update EPC/enclave info from guests as explained above. Kai Huang (10): x86: add SGX Launch Control definition to cpufeature kvm: vmx: add ENCLS VMEXIT detection kvm: vmx: detect presence of host SGX driver kvm: sgx: new functions to init and destory SGX for guest kvm: x86: add KVM_GET_SUPPORTED_CPUID SGX support kvm: x86: add KVM_SET_CPUID2 SGX support kvm: vmx: add SGX IA32_FEATURE_CONTROL MSR emulation kvm: vmx: add guest's IA32_SGXLEPUBKEYHASHn runtime switch support kvm: vmx: handle ENCLS VMEXIT kvm: vmx: handle VMEXIT from SGX Enclave arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/kvm_host.h | 9 +- arch/x86/include/asm/msr-index.h | 7 + arch/x86/include/asm/vmx.h | 4 + arch/x86/include/uapi/asm/vmx.h | 5 +- arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/cpuid.c | 21 +- arch/x86/kvm/cpuid.h | 22 ++ arch/x86/kvm/sgx.c | 463 +++++++++++++++++++++++ arch/x86/kvm/sgx.h | 105 ++++++ arch/x86/kvm/svm.c | 11 +- arch/x86/kvm/vmx.c | 752 +++++++++++++++++++++++++++++++++++-- 12 files changed, 1362 insertions(+), 40 deletions(-) create mode 100644 arch/x86/kvm/sgx.c create mode 100644 arch/x86/kvm/sgx.h -- 2.11.0