> -----Original Message----- > From: Dmytro Maluka <dmy@xxxxxxxxxxxx> > Sent: Friday, June 9, 2023 4:35 PM > To: Chen, Jason CJ <jason.cj.chen@xxxxxxxxx>; Christopherson,, Sean > <seanjc@xxxxxxxxxx> > Cc: kvm@xxxxxxxxxxxxxxx; android-kvm@xxxxxxxxxx; Dmitry Torokhov > <dtor@xxxxxxxxxxxx>; Tomasz Nowicki <tn@xxxxxxxxxxxx>; Grzegorz Jaszczyk > <jaz@xxxxxxxxxxxx>; Keir Fraser <keirf@xxxxxxxxxx> > Subject: Re: [RFC PATCH part-5 00/22] VMX emulation > > On 6/9/23 04:07, Chen, Jason CJ wrote: > >> -----Original Message----- > >> From: Dmytro Maluka <dmy@xxxxxxxxxxxx> > >> Sent: Friday, June 9, 2023 5:38 AM > >> To: Chen, Jason CJ <jason.cj.chen@xxxxxxxxx>; Christopherson,, Sean > >> <seanjc@xxxxxxxxxx> > >> Cc: kvm@xxxxxxxxxxxxxxx; android-kvm@xxxxxxxxxx; Dmitry Torokhov > >> <dtor@xxxxxxxxxxxx>; Tomasz Nowicki <tn@xxxxxxxxxxxx>; Grzegorz > >> Jaszczyk <jaz@xxxxxxxxxxxx>; Keir Fraser <keirf@xxxxxxxxxx> > >> Subject: Re: [RFC PATCH part-5 00/22] VMX emulation > >> > >> On 3/14/23 17:29, Jason Chen CJ wrote: > >>> On Mon, Mar 13, 2023 at 09:58:27AM -0700, Sean Christopherson wrote: > >>>> On Mon, Mar 13, 2023, Jason Chen CJ wrote: > >>>>> This patch set is part-5 of this RFC patches. It introduces VMX > >>>>> emulation for pKVM on Intel platform. > >>>>> > >>>>> Host VM wants the capability to run its guest, it needs VMX support. > >>>> > >>>> No, the host VM only needs a way to request pKVM to run a VM. If > >>>> we go down the rabbit hole of pKVM on x86, I think we should take > >>>> the red pill[*] and go all the way down said rabbit hole by heavily > >>>> paravirtualizing > >> the KVM=>pKVM interface. > >>> > >>> hi, Sean, > >>> > >>> Like I mentioned in the reply for "[RFC PATCH part-1 0/5] pKVM on > >>> Intel Platform Introduction", we hope VMX emulation can be there at > >>> least for normal VM support. > >>> > >>>> > >>>> Except for VMCALL vs. VMMCALL, it should be possible to eliminate > >>>> all traces of VMX and SVM from the interface. That means no VMCS > >>>> emulation, no EPT shadowing, etc. As a bonus, any paravirt stuff > >>>> we do for pKVM x86 would also be usable for KVM-on-KVM nested > virtualization. > >>>> > >>>> E.g. an idea floating around my head is to add a paravirt paging > >>>> interface for KVM-on-KVM so that L1's (KVM-high in this RFC) > >>>> doesn't need to maintain its own TDP page tables. I haven't > >>>> pursued that idea in any real capacity since most nested > >>>> virtualization use cases for KVM involve running an older L1 kernel > >>>> and/or a non-KVM L1 hypervisor, i.e. there's no concrete use case > >>>> to justify the development and > >> maintenance cost. But if the PV code is "needed" by pKVM anyways... > >>> > >>> Yes, I agree, we could have performance & mem cost benefit by using > >>> paravirt stuff for KVM-on-KVM nested virtualization. May I know do I > >>> miss other benefit you saw? > >> > >> As I see it, the advantages of a PV design for pKVM are: > >> > >> - performance > >> - memory cost > >> - code simplicity (of the pKVM hypervisor, first of all) > >> - better alignment with the pKVM on ARM > >> > >> Regarding performance, I actually suspect it may even be the least > >> significant of the above. I guess with a PV design we'd have roughly > >> as many extra vmexits as we have now (just due to hypercalls instead > >> of traps on emulated VMX instructions etc), so perhaps the > >> performance improvement would be not as big as we might expect (am I > wrong?). > > > > I think with PV design, we can benefit from skip shadowing. For > > example, a TLB flush could be done in hypervisor directly, while > > shadowing EPT need emulate it by destroy shadow EPT page table entries then > do next shadowing upon ept violation. > > Yeah indeed, good point. > > Is my understanding correct: TLB flush is still gonna be requested by the host VM > via a hypercall, but the benefit is that the hypervisor merely needs to do INVEPT? Sorry for later response, in my P.O.V, we should let EPT totally owned by the hypervisor, so host VM will not trigger TLB flush as it does not manage EPT directly. > > > > > Based on PV, with well-designed interfaces, I suppose we can also make > > some general design for nested support on KVM-on-hypervisor (e.g., we > > can do first for KVM-on-KVM then extend to support KVM-on-pKVM and > > others) > > Yep, as Sean suggested. Forgot to mention this too. > > > > >> > >> But the memory cost advantage seems to be very attractive. With the > >> emulated design pKVM needs to maintain shadow page tables (and other > >> shadow structures too, but page tables are the most memory > >> demanding). Moreover, the number of shadow page tables is obviously > >> proportional to the number of VMs running, and since pKVM reserves > >> all its memory upfront preparing for the worst case, we have pretty > >> restrictive limits on the maximum number of VMs [*] (and if we run fewer > VMs than this limit, we waste memory). > >> > >> To give some numbers, on a machine with 8GB of RAM, on ChromeOS with > >> this > >> pKVM-on-x86 PoC currently we have pKVM memory cost of 229MB (and it > >> only allows up to 10 VMs running simultaneously), while on Android > >> (ARM) it is afaik only 44MB. According to my analysis, if we get rid > >> of all the shadow tables in pKVM, we should have 44MB on x86 too > >> (regardless of the maximum number of VMs). > >> > >> [*] And some other limits too, e.g. on the maximum number of > >> DMA-capable devices, since pKVM also needs shadow IOMMU page tables > >> if we have only 1- stage IOMMU. > > > > I may not capture your meaning. Do you mean device want 2-stage while > > we only have 1-stage IOMMU? If so, not sure if there is real use case. > > > > Per my understanding, if for PV IOMMU, the simplest implementation is > > just maintain 1-stage DMA mapping in the hypervisor as guest most > > likely just want 1-stage DMA mapping for its device, so if for IOMMU > > w/ nested capability meantime guest want use its nested capability > > (e.g., for vSVA), we can further extend the PV IOMMU interfaces. > > Sorry, I wasn't clear enough. I mean, on the host or guest side we need just 1- > stage IOMMU, but pKVM needs to ensure memory protection. So if 2-stage is > available, pKVM can just use it, but if not, currently in pKVM on Intel we use > shadow page tables for that (just as a consequence of the overall "mostly > emulated" design). (So as a result, in particular, pKVM memory footprint > depends on the max number of PCI devices allowed by pKVM.) And yeah, with a > PV IOMMU we can avoid the need for shadow page tables while still having only > 1-stage IOMMU, that's exactly my point. > > > > >> > >>> > >>>> > >>>> [*] You take the blue pill, the story ends, you wake up in your bed and > believe > >>>> whatever you want to believe. You take the red pill, you stay in > wonderland, > >>>> and I show you how deep the rabbit hole goes. > >>>> > >>>> -Morpheus > >>>