Hi all, This RFC series provides the infrastructure enabling to wrap the host kernel with a stage 2 when running KVM in nVHE. This can be useful for several use-cases, but the primary motivation is to (eventually) be able to protect guest memory from the host kernel. More details about the overall idea, design, and motivations can be found in Will's talk at KVM Forum 2020 [1], or the pKVM talk at the Android uconf during LPC 2020 [2]. This series essentially gets us to a point where the 'VM' bit is set in the host's HCR_EL2 when running in nVHE and if 'kvm-arm.protected' is set on the kernel command line. The EL2 object directly handles memory aborts from the host and manages entirely its stage 2 page table. However, this series does _not_ provide any real user for this (yet) and simply idmaps everything into the host stage 2 as RWX cacheable. This is all about the infrastructure for now, so clearly not ready for inclusion upstream yet (hence the RFC tag), but the bases are there and I thought it'd be useful to start a discussion with the community early as this is a rather intrusive change. So, here goes. One of the interesting requirements that comes with the series is that managing page-tables requires some sort of memory allocator at EL2 to allocate, refcount and free memory pages. Clearly, none of that is currently possible in nVHE, so a significant chunk of the series is dedicated to solving that problem. The proposed EL2 memory allocator mimics Linux' buddy system in principles, and re-uses some of the arm64 mm design choices. Specifically, it uses a vmemmap at EL2 which contains a set of struct hyp_page entries to hold pages metadata. To support this, I extended the EL2 object to make it manage its own stage 1 page-table in addition to host stage 2. This simplifies the hyp_vmemmap creation and was going to be required anyway for the protected VM use-case -- the threat model implies the host cannot be trusted after boot, and it will thus be crucial to ensure it cannot map arbitrary code at EL2. The pool of memory pages used by the EL2 allocator are reserved by the host early during boot (while it is still trusted) using the memblock API, and are donated to EL2 during KVM init. The current assumption is that the host reserves enough pages to allow the EL2 object to map all of memory at page granularity for both hyp stage 1 and host stage 2, plus some extra pages for device mappings. On top of that the series introduces a few smaller features that are needed along the way, but hopefully all of those are detailed properly in the relevant commit messages. And as a last note, I'd like to point out that there are at this point trivial ways for the host to circumvent its stage 2 protection. It still owns the guests stage 2 for example, meaning that nothing would prevent a malicious host from using a guest as a proxy to access protected memory, _yet_. This series lays the ground for future work to address these things, which will clearly require a stage 2 over the host at some point, so I just wanted to set the expectations right. With all that in mind, the series is organized as follows: - patches 01-03 provide EL2 with some utility libraries needed for memory management and synchronization; - patches 04-09 mostly refactor smalls portions of the code to ease the EL2 memory management; - patches 10-17 add the actual EL2 memory management code, as well as the setup/bootstrap code on the KVM init path; - patches 18-24 refactor the existing stage 2 management code to make it re-usable from the EL2 object; - and finally patches 25-27 introduce the host stage 2 and the trap handling logic at EL2. This work is based on the latest kvmarm/queue (which includes Marc's host EL2 entry rework [3], as well as Will's guest vector refactoring [4]) + David's PSCI proxying series [5]. And if you'd like a branch that has all the bits and pieces: https://android-kvm.googlesource.com/linux qperret/host-stage2 Boot-tested (host and guest) using qemu in VHE and nVHE, and on real hardware on a AML-S905X-CC (Le Potato). Thanks, Quentin [1] https://kvmforum2020.sched.com/event/eE24/virtualization-for-the-masses-exposing-kvm-on-android-will-deacon-google [2] https://youtu.be/54q6RzS9BpQ?t=10859 [3] https://lore.kernel.org/kvmarm/20201109175923.445945-1-maz@xxxxxxxxxx/ [4] https://lore.kernel.org/kvmarm/20201113113847.21619-1-will@xxxxxxxxxx/ [5] https://lore.kernel.org/kvmarm/20201116204318.63987-1-dbrazdil@xxxxxxxxxx/ Quentin Perret (24): KVM: arm64: Initialize kvm_nvhe_init_params early KVM: arm64: Avoid free_page() in page-table allocator KVM: arm64: Factor memory allocation out of pgtable.c KVM: arm64: Introduce a BSS section for use at Hyp KVM: arm64: Make kvm_call_hyp() a function call at Hyp KVM: arm64: Allow using kvm_nvhe_sym() in hyp code KVM: arm64: Introduce an early Hyp page allocator KVM: arm64: Stub CONFIG_DEBUG_LIST at Hyp KVM: arm64: Introduce a Hyp buddy page allocator KVM: arm64: Enable access to sanitized CPU features at EL2 KVM: arm64: Factor out vector address calculation of/fdt: Introduce early_init_dt_add_memory_hyp() KVM: arm64: Prepare Hyp memory protection KVM: arm64: Elevate Hyp mappings creation at EL2 KVM: arm64: Use kvm_arch for stage 2 pgtable KVM: arm64: Use kvm_arch in kvm_s2_mmu KVM: arm64: Set host stage 2 using kvm_nvhe_init_params KVM: arm64: Refactor kvm_arm_setup_stage2() KVM: arm64: Refactor __load_guest_stage2() KVM: arm64: Refactor __populate_fault_info() KVM: arm64: Make memcache anonymous in pgtable allocator KVM: arm64: Reserve memory for host stage 2 KVM: arm64: Sort the memblock regions list KVM: arm64: Wrap the host with a stage 2 Will Deacon (3): arm64: lib: Annotate {clear,copy}_page() as position-independent KVM: arm64: Link position-independent string routines into .hyp.text KVM: arm64: Add standalone ticket spinlock implementation for use at hyp arch/arm64/include/asm/cpufeature.h | 1 + arch/arm64/include/asm/hyp_image.h | 4 + arch/arm64/include/asm/kvm_asm.h | 13 +- arch/arm64/include/asm/kvm_cpufeature.h | 19 ++ arch/arm64/include/asm/kvm_host.h | 17 +- arch/arm64/include/asm/kvm_hyp.h | 8 + arch/arm64/include/asm/kvm_mmu.h | 69 +++++- arch/arm64/include/asm/kvm_pgtable.h | 41 +++- arch/arm64/include/asm/sections.h | 1 + arch/arm64/kernel/asm-offsets.c | 3 + arch/arm64/kernel/cpufeature.c | 14 +- arch/arm64/kernel/image-vars.h | 35 +++ arch/arm64/kernel/vmlinux.lds.S | 7 + arch/arm64/kvm/arm.c | 136 +++++++++-- arch/arm64/kvm/hyp/Makefile | 2 +- arch/arm64/kvm/hyp/include/hyp/switch.h | 36 +-- arch/arm64/kvm/hyp/include/nvhe/early_alloc.h | 14 ++ arch/arm64/kvm/hyp/include/nvhe/gfp.h | 32 +++ arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 33 +++ arch/arm64/kvm/hyp/include/nvhe/memory.h | 55 +++++ arch/arm64/kvm/hyp/include/nvhe/mm.h | 107 +++++++++ arch/arm64/kvm/hyp/include/nvhe/spinlock.h | 95 ++++++++ arch/arm64/kvm/hyp/include/nvhe/util.h | 25 ++ arch/arm64/kvm/hyp/nvhe/Makefile | 9 +- arch/arm64/kvm/hyp/nvhe/cache.S | 13 ++ arch/arm64/kvm/hyp/nvhe/cpufeature.c | 8 + arch/arm64/kvm/hyp/nvhe/early_alloc.c | 60 +++++ arch/arm64/kvm/hyp/nvhe/hyp-init.S | 39 ++++ arch/arm64/kvm/hyp/nvhe/hyp-main.c | 50 ++++ arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 1 + arch/arm64/kvm/hyp/nvhe/mem_protect.c | 191 ++++++++++++++++ arch/arm64/kvm/hyp/nvhe/mm.c | 175 ++++++++++++++ arch/arm64/kvm/hyp/nvhe/page_alloc.c | 185 +++++++++++++++ arch/arm64/kvm/hyp/nvhe/psci-relay.c | 7 +- arch/arm64/kvm/hyp/nvhe/setup.c | 214 ++++++++++++++++++ arch/arm64/kvm/hyp/nvhe/stub.c | 22 ++ arch/arm64/kvm/hyp/nvhe/switch.c | 12 +- arch/arm64/kvm/hyp/nvhe/tlb.c | 4 +- arch/arm64/kvm/hyp/pgtable.c | 98 ++++---- arch/arm64/kvm/hyp/reserved_mem.c | 95 ++++++++ arch/arm64/kvm/mmu.c | 114 +++++++++- arch/arm64/kvm/reset.c | 42 +--- arch/arm64/lib/clear_page.S | 4 +- arch/arm64/lib/copy_page.S | 4 +- arch/arm64/mm/init.c | 3 + drivers/of/fdt.c | 5 + 46 files changed, 1971 insertions(+), 151 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_cpufeature.h create mode 100644 arch/arm64/kvm/hyp/include/nvhe/early_alloc.h create mode 100644 arch/arm64/kvm/hyp/include/nvhe/gfp.h create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h create mode 100644 arch/arm64/kvm/hyp/include/nvhe/memory.h create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mm.h create mode 100644 arch/arm64/kvm/hyp/include/nvhe/spinlock.h create mode 100644 arch/arm64/kvm/hyp/include/nvhe/util.h create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S create mode 100644 arch/arm64/kvm/hyp/nvhe/cpufeature.c create mode 100644 arch/arm64/kvm/hyp/nvhe/early_alloc.c create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c create mode 100644 arch/arm64/kvm/hyp/nvhe/mm.c create mode 100644 arch/arm64/kvm/hyp/nvhe/page_alloc.c create mode 100644 arch/arm64/kvm/hyp/nvhe/setup.c create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c create mode 100644 arch/arm64/kvm/hyp/reserved_mem.c -- 2.29.2.299.gdc1121823c-goog