Re: [RFC PATCH 00/73] KVM: x86/PVM: Introduce a new hypervisor

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 27 Feb 2024 09:27:33 -0800

On Mon, Feb 26, 2024, Paolo Bonzini wrote:
> On Mon, Feb 26, 2024 at 3:34 PM Lai Jiangshan <jiangshanlai@xxxxxxxxx> wrote:
> > - Full control: In XENPV/Lguest, the host Linux (dom0) entry code is
> >   subordinate to the hypervisor/switcher, and the host Linux kernel
> >   loses control over the entry code. This can cause inconvenience if
> >   there is a need to update something when there is a bug in the
> >   switcher or hardware.  Integral entry gives the control back to the
> >   host kernel.
> >
> > - Zero overhead incurred: The integrated entry code doesn't cause any
> >   overhead in host Linux entry path, thanks to the discreet design with
> >   PVM code in the switcher, where the PVM path is bypassed on host events.
> >   While in XENPV/Lguest, host events must be handled by the
> >   hypervisor/switcher before being processed.
> 
> Lguest... Now that's a name I haven't heard in a long time. :)  To be
> honest, it's a bit weird to see yet another PV hypervisor. I think
> what really killed Xen PV was the impossibility to protect from
> various speculation side channel attacks, and I would like to
> understand how PVM fares here.
> 
> You obviously did a great job in implementing this within the KVM
> framework; the changes in arch/x86/ are impressively small. On the
> other hand this means it's also not really my call to decide whether
> this is suitable for merging upstream. The bulk of the changes are
> really in arch/x86/kernel/ and arch/x86/entry/, and those are well
> outside my maintenance area.

The bulk of changes in _this_ patchset are outside of arch/x86/kvm, but there are
more changes on the horizon:

 : To mitigate the performance problem, we designed several optimizations
 : for the shadow MMU (not included in the patchset) and also planning to
 : build a shadow EPT in L0 for L2 PVM guests.

 : - Parallel Page fault for SPT and Paravirtualized MMU Optimization.

And even absent _new_ shadow paging functionality, merging PVM would effectively
shatter any hopes of ever removing KVM's existing, complex shadow paging code.

Specifically, unsync 4KiB PTE support in KVM provides almost no benefit for nested
TDP.  So if we can ever drop support for legacy shadow paging, which is a big if,
but not completely impossible, then we could greatly simplify KVM's shadow MMU.

Which is a good segue into my main question: was there any one thing that was
_the_ motivating factor for taking on the cost+complexity of shadow paging?  And
as alluded to be Paolo, taking on the downsides of reduced isolation?

It doesn't seem like avoiding L0 changes was the driving decision, since IIUC
you have plans to make changes there as well.

 : To mitigate the performance problem, we designed several optimizations
 : for the shadow MMU (not included in the patchset) and also planning to
 : build a shadow EPT in L0 for L2 PVM guests.

Performance I can kinda sorta understand, but my gut feeling is that the problems
with nested virtualization are solvable by adding nested paravirtualization between
L0<=>L1, with likely lower overall cost+complexity than paravirtualizing L1<=>L2.

The bulk of the pain with nested hardware virtualization lies in having to emulate
VMX/SVM, and shadow L1's TDP page tables.  Hyper-V's eVMCS takes some of the sting
off nVMX in particular, but eVMCS is still hobbled by its desire to be almost
drop-in compatible with VMX.

If we're willing to define a fully PV interface between L0 and L1 hypervisors, I
suspect we provide performance far, far better than nVMX/nSVM.  E.g. if L0 provides
a hypercall to map an L2=>L1 GPA, then L0 doesn't need to shadow L1 TDP, and L1
doesn't even need to maintain hardware-defined page tables, it can use whatever
software-defined data structure best fits it needs.

And if we limit support to 64-bit L2 kernels and drop support for unnecessary cruft,
the L1<=>L2 entry/exit paths could be drastically simplified and streamlined.  And
it should be very doable to concoct an ABI between L0 and L2 that allows L0 to
directly emulate "hot" instructions from L2, e.g. CPUID, common MSRs, etc.  I/O
would likely be solvable too, e.g. maybe with a mediated device type solution that
allows L0 to handle the data path for L2?

The one thing that I don't see line of sight to supporting is taking L0 out of the
TCB, i.e. running L2 VMs inside TDX/SNP guests.  But for me at least, that alone
isn't sufficient justification for adding a PV flavor of KVM.