[RFC PATCH part-1 0/5] pKVM on Intel Platform Introduction

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Protected-KVM (pKVM) on Intel platform is designed as a thin hypervisor
to extend KVM supporting VMs isolated from the host.

I am sending out this RFC requesting early review. The patches are in a
little early stage and with large LOC, I hope they can present you the
basic idea of pKVM on Intel platform, and give you an overview of most
fundamental changes needed to launch VMs on top of pKVM hypervisor. The
patches are finally intended to slice into more digestible pieces for
review and merge.

The concept of pKVM is first introduced by Google for ARM platform
[1][2][3], which aims to extend Trust Execution Environment (TEE) from
ARM secure world to virtual machines (VMs). Such VMs are protected by the
pKVM from the host OS or other VMs accessing the payloads running inside
(so called protected VM). More details about the overall idea, design,
and motivations can be found in Will's talk at KVM forum 2020 [4].

There are similar use cases on x86 platforms requesting protected
environment which is isolated from host OS for confidential computing.
Meanwhile host OS still presents the primary user interface and people
will expect the same bare metal experience as before in terms of both
performance and functionalities (like rich-IO usages), so the host OS
is desired to remain the ability to manage the system resources as much
as possible. At the same time, in order to mitigate the attack to the
confidential computing environment, the Trusted Computing Base (TCB) of
the solution shall be minimized.

HW solutions e.g. TDX [5] also exist to support above use cases. But
they are available only on very new platforms. Hence having a software
solution on massive existing platforms is also plausible.

pKVM has the merit of both providing an isolated environment for
protected VMs and also sustaining rich bare metal experiences as
expected by the host OS. This is achieved by creating a small
hypervisor below the host OS which contains only minimal
functionalities (e.g. VMX, EPT, IOMMU, etc.) for isolating protected
VMs from host OS and other VMs. In the meantime the host kernel still
remains access to most of the system resources and plays the role of
managing VM life cycles, allocating VM resources, etc. Existing KVM
module calls into the hypervisor (via emulation or enlightened PV ops)
to complete missing functionalities which have been moved downward.

      +--------------------+   +-----------------+
      |                    |   |                 |
      |     host VM        |   |  protected VM   |
      |    (act like       |   |                 |
      |   on bare metal)   |   |                 |
      |			   |   +-----------------+
      |                    +---------------------+
      |            +--------------------+        |
      |            | vVMX, vEPT, vIOMMU |        |
      |            +--------------------+        |
      +------------------------------------------+
      +------------------------------------------+
      |       pKVM (own VMX, EPT, IOMMU)         |
      +------------------------------------------+

[note: above figure is based on Intel terminologies]

The terminologies used in this RFC series:

- host VM:     native Linux which boot pKVM then deprivilege to a VM
- protected VM: VM launched by host but protected by pKVM
- normal VM:    VM launched & protected by host

pKVM binary is compiled as an extension of KVM module, but resides in a
separate, dedicated memory section of the vmlinux image. It makes pKVM
easy to release and verified boot together with Linux kernel image. It
also means pKVM is a post-launched hypervisor since it's started by KVM
module.

ARM platform naturally supports different exception level (EL) and the
host kernel can be set to run at EL1 during the early boot stage before
launching pKVM hypervisor, so pKVM just needs to be installed to EL2.
On Intel platform, the host Linux kernel is originally running in VMX
root mode, then deprivileged to run into vmx non-root mode as a host VM,
whereas pKVM is kept running at VMX root mode. Comparing with pKVM on
ARM, pKVM on Intel platform needs more complicated deprivilege stage to
prepare and setup VMX environment in VMX root mode.

As a hypervisor, pKVM on Intel platform leverages virtualization
technologies (see below) to guarantee the isolation among itself and low
privilege guests (include host Linux) on top of it:

 - pKVM manages CPU state/context switch between hypervisor and different
   guests. It's largely done by VMCS.

 - pKVM owns EPT page table to manage the GPA to HPA mapping of its host
   VM and guest VMs, which ensures they will not touch the hypervisor's
   memory and isolate among each other. It's similar to pKVM on ARM which
   owns stage-2 MMU page table to isolate memory among hypervisor, host,
   protected VMs and normal VMs. To allow host manage EPT or stage-2 page
   tables, pKVM can choose to provide either PV ops or emulation for these
   page tables. pKVM on ARM chose PV ops, which providing hypervisor calls
   (HVCs) in pKVM for stage-2 MMU page table changes. pKVM on Intel
   platform provides emulation for EPT page table management - this avoids
   the code changes in x86 KVM MMU.

 - pKVM owns IOMMU (VT-d for Intel platform and SMMU for ARM platform)
   to manage device DMA buffer mapping to isolate DMA access. To allow
   host manage IOMMU page tables, smilar to EPT/stage-2 page table
   management, PV ops or emulation method could be chosen. pKVM on ARM
   chose PV ops [6], while pKVM on Intel platform will use IOMMU
   emulation (this RFC does not cover it and we are willing to change if
   see more advantages from PV ops).

A topic in KVM forum 2022 about supporting TEE on x86 client platforms
with pKVM [7] may help you understand more details about the framework
of pKVM on Intel platforms and the deltas between pKVM on Intel and ARM
platforms.

This RFC patch series is essential groundwork for future patch series.
Based on this RFC, host OS is deprivileged and normal VM can be launched
on top of pKVM hypervisor. Following is the TODO list after this series:

- protected VMs
   * page state management
   * security enforcement at vCPU context switch
   * QEMU & crosvm
   * fd-based proposal around KVM private memory [8]
   * guest attestation

- pass-thru devices
   * IOMMU virtualization

This RFC series is organized as follows:

  - Part-1 (this patch set) are refactor of small portions of the pKVM on
    ARM code to ease the pKVM on Intel platform's support;

  - Part-2 introduce pKVM on Intel platform and do the deprivilege for
    host OS, meantime build pKVM as an independent binary;

  - Part-3 introduce pgtable management in pKVM on Intel platform then
    finally isolate pKVM & host VM through creating its own address
    space (MMU + host EPT);

  - Part-4 are misc changes to support VPID, debug and nmi handling in
    pKVM on Intel platform;

  - Part-5 add VMX emulation based on shadow VMCS;

  - Part-6 add EPT emulation based on shadow EPT;

  - and finally part-7 add memory protection based on page stage
    management.

This work is based on Linux 6.2, and you can also get the branch if
you would like to:

  https://github.com/intel-staging/pKVM-IA/tree/RFC-v6.2

Thanks

Jason CJ Chen

[1]: https://lwn.net/Articles/836693/
[2]: https://lwn.net/Articles/837552/
[3]: https://lwn.net/Articles/895790/
[4]: https://kvmforum2020.sched.com/event/eE24/virtualization-for-the-masses-exposing-kvm-on-android-will-deacon-google
[5]: https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
[6]: https://lore.kernel.org/linux-arm-kernel/20230201125328.2186498-1-jean-philippe@xxxxxxxxxx/T/
[7]: https://kvmforum2022.sched.com/event/15jKc/supporting-tee-on-x86-client-platforms-with-pkvm-jason-chen-intel
[8]: https://lwn.net/Articles/916589/

Jason Chen CJ (5):
  pkvm: arm64: Move nvhe/spinlock.h to include/asm dir
  pkvm: arm64: Make page allocator arch agnostic
  pkvm: arm64: Move page allocator to virt/kvm/pkvm
  pkvm: arm64: Make memory reservation arch agnostic
  pkvm: arm64: Move general part of memory reservation to virt/kvm/pkvm

 arch/arm64/include/asm/kvm_pkvm.h             |  8 ++
 .../asm/pkvm_spinlock.h}                      |  6 +-
 arch/arm64/kvm/Makefile                       |  3 +
 arch/arm64/kvm/hyp/hyp-constants.c            |  2 +-
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  2 +-
 arch/arm64/kvm/hyp/include/nvhe/mm.h          |  4 +-
 arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |  4 +-
 arch/arm64/kvm/hyp/nvhe/Makefile              |  4 +-
 arch/arm64/kvm/hyp/nvhe/early_alloc.c         |  2 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  4 +-
 arch/arm64/kvm/hyp/nvhe/mm.c                  |  6 +-
 arch/arm64/kvm/hyp/nvhe/pkvm.c                |  2 +-
 arch/arm64/kvm/hyp/nvhe/psci-relay.c          |  2 +-
 arch/arm64/kvm/hyp/nvhe/setup.c               |  4 +-
 arch/arm64/kvm/pkvm.c                         | 76 ++---------------
 .../memory.h => virt/kvm/pkvm/buddy_memory.h  | 10 +--
 .../hyp/include/nvhe => virt/kvm/pkvm}/gfp.h  | 10 +--
 .../hyp/nvhe => virt/kvm/pkvm}/page_alloc.c   |  3 +-
 virt/kvm/pkvm/pkvm.c                          | 84 +++++++++++++++++++
 19 files changed, 134 insertions(+), 102 deletions(-)
 rename arch/arm64/{kvm/hyp/include/nvhe/spinlock.h => include/asm/pkvm_spinlock.h} (95%)
 rename arch/arm64/kvm/hyp/include/nvhe/memory.h => virt/kvm/pkvm/buddy_memory.h (89%)
 rename {arch/arm64/kvm/hyp/include/nvhe => virt/kvm/pkvm}/gfp.h (86%)
 rename {arch/arm64/kvm/hyp/nvhe => virt/kvm/pkvm}/page_alloc.c (99%)
 create mode 100644 virt/kvm/pkvm/pkvm.c

-- 
2.25.1




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux