Hi Sean, Thanks for having a look. On Wed, Jul 06, 2022 at 07:17:29PM +0000, Sean Christopherson wrote: > On Thu, Jun 30, 2022, Will Deacon wrote: > > This series has been extracted from the pKVM base support series (aka > > "pKVM mega-patch") previously posted here: > > > > https://lore.kernel.org/kvmarm/20220519134204.5379-1-will@xxxxxxxxxx/ > > > > Unlike that more comprehensive series, this one is fairly fundamental > > and does not introduce any new ABI commitments, leaving questions > > involving the management of guest private memory and the creation of > > protected VMs for future work. Instead, this series extends the pKVM EL2 > > code so that it can dynamically instantiate and manage VM shadow > > structures without the host being able to access them directly. These > > shadow structures consist of a shadow VM, a set of shadow vCPUs and the > > stage-2 page-table and the pages used to hold them are returned to the > > host when the VM is destroyed. > > > > The last patch is marked as RFC because, although it plumbs in the > > shadow state, it is woefully inefficient and copies to/from the host > > state on every vCPU run. Without the last patch, the new structures are > > unused but we move considerably closer to isolating guests from the > > host. > > ... > > > arch/arm64/include/asm/kvm_asm.h | 6 +- > > arch/arm64/include/asm/kvm_host.h | 65 +++ > > arch/arm64/include/asm/kvm_hyp.h | 3 + > > arch/arm64/include/asm/kvm_pgtable.h | 8 + > > arch/arm64/include/asm/kvm_pkvm.h | 38 ++ > > arch/arm64/kernel/image-vars.h | 15 - > > arch/arm64/kvm/arm.c | 40 +- > > arch/arm64/kvm/hyp/hyp-constants.c | 3 + > > arch/arm64/kvm/hyp/include/nvhe/gfp.h | 6 +- > > arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 19 +- > > arch/arm64/kvm/hyp/include/nvhe/memory.h | 26 +- > > arch/arm64/kvm/hyp/include/nvhe/mm.h | 18 +- > > arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 70 +++ > > arch/arm64/kvm/hyp/include/nvhe/spinlock.h | 10 +- > > arch/arm64/kvm/hyp/nvhe/cache.S | 11 + > > arch/arm64/kvm/hyp/nvhe/hyp-main.c | 105 +++- > > arch/arm64/kvm/hyp/nvhe/hyp-smp.c | 2 + > > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 456 +++++++++++++++++- > > arch/arm64/kvm/hyp/nvhe/mm.c | 136 +++++- > > arch/arm64/kvm/hyp/nvhe/page_alloc.c | 42 +- > > arch/arm64/kvm/hyp/nvhe/pkvm.c | 438 +++++++++++++++++ > > arch/arm64/kvm/hyp/nvhe/setup.c | 96 ++-- > > arch/arm64/kvm/hyp/pgtable.c | 9 + > > arch/arm64/kvm/mmu.c | 26 + > > arch/arm64/kvm/pkvm.c | 121 ++++- > > 25 files changed, 1625 insertions(+), 144 deletions(-) > > create mode 100644 arch/arm64/kvm/hyp/include/nvhe/pkvm.h > > The lack of documentation and the rather terse changelogs make this really hard > to review for folks that aren't intimately familiar with pKVM. I have a decent > idea of the end goal of "shadowing", but that's mostly because of my involvement in > similar x86 projects. Nothing in the changelogs ever explains _why_ pKVM uses > shadows. That's understandable, but thanks for persevering; this series is pretty down in the murky depths of the arm64 architecture and EL2 code so it doesn't really map to the KVM code most folks are familiar with. It's fair to say we're assuming a lot of niche prior knowledge (which is quite common for arch code in my experience), but I wanted to inherit the broader cc list so you were aware of this break-away series. Sadly, I don't think beefing up the commit messages would get us to a point where somebody unfamiliar with the EL2 code already could give a constructive review, but we can try to expand them a bit if you genuinely think it would help. On the more positive side, we'll be speaking at KVM forum about what we've done here, so that will be a great place to discuss it further and then we can also link back to the recordings in later postings of the mega-series. > I put "shadowing" in quotes because if the unstrusted host is aware that the VM > and vCPU it is manipulating aren't the "real" VMs/vCPUs, and there is an explicit API > between the untrusted host and pKVM for creating/destroying VMs/vCPUs, then I would > argue that it's not truly shadowing, especially if pKVM uses data/values verbatim > and only verifies correctness/safety. It's definitely a nit, but for future readers > I think overloading "shadowing" could be confusing. Ah, this is really interesting and nicely puts the ball back in my court as I'm not well versed with x86's use of "shadowing". We should probably use another name (ideas?), but our "shadow" is very much explicit -- rather than the host using its 'struct kvm's and 'struct kvm_vcpu's directly, it instead passes those data structures to the hypervisor which allocates its own structures (in some cases reusing the host definitions directly, e.g. 'struct kvm_vcpu') and returning a handle to the host as a reference. The host can then issue hypercalls with this handle to load/put vCPUs of that VM, run them once they are loaded and synchronise aspects of the vCPU state between the host and the hypervisor copies for things like emulation traps or interrupt injection. The main thing is that the pages containing the hypervisor structures are not accessible by the host until the corresponding VM is destroyed. The advantage of doing it this way is that we don't need to change very much of the host KVM code at all, and we can even reuse some of it directly in the hypervisor (e.g. inline functions and macros). Perhaps we should s/shadow/hyp/ to make this a little clearer? > And beyond the basics, IMO pKVM needs a more formal definition of exactly what > guest state is protected/hidden from the untrusted host. Peeking at the mega series, > there are a huge pile of patches that result in "gradual reduction of EL2 trust in > host data", but I couldn't any documentation that defines what that end result is. That's fair; I'll work to extend the documentation in the next iteration of the mega series to cover this in more detail. Roughly speaking, the end result is that the vCPU register and memory state is inaccessible to the host except in cases where the guest has done something to expose it such as MMIO or a memory sharing hypercall. Given the complexity of the register state (GPRs, floating point, SIMD, debug, etc) the mega-series elavates portions of the state from the host to the hypervisor as separate patches to structure things a bit better (that's where the gradual reduction comes in). Does that help at all? Cheers, Will