RE: [RFC PATCH part-5 00/22] VMX emulation

"Chen, Jason CJ" <jason.cj.chen@xxxxxxxxx> · Fri, 9 Jun 2023 02:07:47 +0000

> -----Original Message-----
> From: Dmytro Maluka <dmy@xxxxxxxxxxxx>
> Sent: Friday, June 9, 2023 5:38 AM
> To: Chen, Jason CJ <jason.cj.chen@xxxxxxxxx>; Christopherson,, Sean
> <seanjc@xxxxxxxxxx>
> Cc: kvm@xxxxxxxxxxxxxxx; android-kvm@xxxxxxxxxx; Dmitry Torokhov
> <dtor@xxxxxxxxxxxx>; Tomasz Nowicki <tn@xxxxxxxxxxxx>; Grzegorz Jaszczyk
> <jaz@xxxxxxxxxxxx>; Keir Fraser <keirf@xxxxxxxxxx>
> Subject: Re: [RFC PATCH part-5 00/22] VMX emulation
> 
> On 3/14/23 17:29, Jason Chen CJ wrote:
> > On Mon, Mar 13, 2023 at 09:58:27AM -0700, Sean Christopherson wrote:
> >> On Mon, Mar 13, 2023, Jason Chen CJ wrote:
> >>> This patch set is part-5 of this RFC patches. It introduces VMX
> >>> emulation for pKVM on Intel platform.
> >>>
> >>> Host VM wants the capability to run its guest, it needs VMX support.
> >>
> >> No, the host VM only needs a way to request pKVM to run a VM.  If we
> >> go down the rabbit hole of pKVM on x86, I think we should take the
> >> red pill[*] and go all the way down said rabbit hole by heavily paravirtualizing
> the KVM=>pKVM interface.
> >
> > hi, Sean,
> >
> > Like I mentioned in the reply for "[RFC PATCH part-1 0/5] pKVM on
> > Intel Platform Introduction", we hope VMX emulation can be there at
> > least for normal VM support.
> >
> >>
> >> Except for VMCALL vs. VMMCALL, it should be possible to eliminate all
> >> traces of VMX and SVM from the interface.  That means no VMCS
> >> emulation, no EPT shadowing, etc.  As a bonus, any paravirt stuff we
> >> do for pKVM x86 would also be usable for KVM-on-KVM nested virtualization.
> >>
> >> E.g. an idea floating around my head is to add a paravirt paging
> >> interface for KVM-on-KVM so that L1's (KVM-high in this RFC) doesn't
> >> need to maintain its own TDP page tables.  I haven't pursued that
> >> idea in any real capacity since most nested virtualization use cases
> >> for KVM involve running an older L1 kernel and/or a non-KVM L1
> >> hypervisor, i.e. there's no concrete use case to justify the development and
> maintenance cost.  But if the PV code is "needed" by pKVM anyways...
> >
> > Yes, I agree, we could have performance & mem cost benefit by using
> > paravirt stuff for KVM-on-KVM nested virtualization. May I know do I
> > miss other benefit you saw?
> 
> As I see it, the advantages of a PV design for pKVM are:
> 
> - performance
> - memory cost
> - code simplicity (of the pKVM hypervisor, first of all)
> - better alignment with the pKVM on ARM
> 
> Regarding performance, I actually suspect it may even be the least significant of
> the above. I guess with a PV design we'd have roughly as many extra vmexits as
> we have now (just due to hypercalls instead of traps on emulated VMX
> instructions etc), so perhaps the performance improvement would be not as big
> as we might expect (am I wrong?).

I think with PV design, we can benefit from skip shadowing. For example, a TLB flush
could be done in hypervisor directly, while shadowing EPT need emulate it by destroy
shadow EPT page table entries then do next shadowing upon ept violation.

Based on PV, with well-designed interfaces, I suppose we can also make some general
design for nested support on KVM-on-hypervisor (e.g., we can do first for KVM-on-KVM
then extend to support KVM-on-pKVM and others)

> 
> But the memory cost advantage seems to be very attractive. With the emulated
> design pKVM needs to maintain shadow page tables (and other shadow
> structures too, but page tables are the most memory demanding). Moreover,
> the number of shadow page tables is obviously proportional to the number of
> VMs running, and since pKVM reserves all its memory upfront preparing for the
> worst case, we have pretty restrictive limits on the maximum number of VMs [*]
> (and if we run fewer VMs than this limit, we waste memory).
> 
> To give some numbers, on a machine with 8GB of RAM, on ChromeOS with this
> pKVM-on-x86 PoC currently we have pKVM memory cost of 229MB (and it only
> allows up to 10 VMs running simultaneously), while on Android (ARM) it is afaik
> only 44MB. According to my analysis, if we get rid of all the shadow tables in
> pKVM, we should have 44MB on x86 too (regardless of the maximum number of
> VMs).
> 
> [*] And some other limits too, e.g. on the maximum number of DMA-capable
> devices, since pKVM also needs shadow IOMMU page tables if we have only 1-
> stage IOMMU.

I may not capture your meaning. Do you mean device want 2-stage while we only
have 1-stage IOMMU? If so, not sure if there is real use case.

Per my understanding, if for PV IOMMU, the simplest implementation is just
maintain 1-stage DMA mapping in the hypervisor as guest most likely just want 
1-stage DMA mapping for its device,  so if for IOMMU w/ nested capability meantime
guest want use its nested capability (e.g., for vSVA), we can further extend the PV
IOMMU interfaces.

> 
> >
> >>
> >> [*] You take the blue pill, the story ends, you wake up in your bed and believe
> >>     whatever you want to believe. You take the red pill, you stay in wonderland,
> >>     and I show you how deep the rabbit hole goes.
> >>
> >>     -Morpheus
> >