On 6/9/23 04:07, Chen, Jason CJ wrote: >> -----Original Message----- >> From: Dmytro Maluka <dmy@xxxxxxxxxxxx> >> Sent: Friday, June 9, 2023 5:38 AM >> To: Chen, Jason CJ <jason.cj.chen@xxxxxxxxx>; Christopherson,, Sean >> <seanjc@xxxxxxxxxx> >> Cc: kvm@xxxxxxxxxxxxxxx; android-kvm@xxxxxxxxxx; Dmitry Torokhov >> <dtor@xxxxxxxxxxxx>; Tomasz Nowicki <tn@xxxxxxxxxxxx>; Grzegorz Jaszczyk >> <jaz@xxxxxxxxxxxx>; Keir Fraser <keirf@xxxxxxxxxx> >> Subject: Re: [RFC PATCH part-5 00/22] VMX emulation >> >> On 3/14/23 17:29, Jason Chen CJ wrote: >>> On Mon, Mar 13, 2023 at 09:58:27AM -0700, Sean Christopherson wrote: >>>> On Mon, Mar 13, 2023, Jason Chen CJ wrote: >>>>> This patch set is part-5 of this RFC patches. It introduces VMX >>>>> emulation for pKVM on Intel platform. >>>>> >>>>> Host VM wants the capability to run its guest, it needs VMX support. >>>> >>>> No, the host VM only needs a way to request pKVM to run a VM. If we >>>> go down the rabbit hole of pKVM on x86, I think we should take the >>>> red pill[*] and go all the way down said rabbit hole by heavily paravirtualizing >> the KVM=>pKVM interface. >>> >>> hi, Sean, >>> >>> Like I mentioned in the reply for "[RFC PATCH part-1 0/5] pKVM on >>> Intel Platform Introduction", we hope VMX emulation can be there at >>> least for normal VM support. >>> >>>> >>>> Except for VMCALL vs. VMMCALL, it should be possible to eliminate all >>>> traces of VMX and SVM from the interface. That means no VMCS >>>> emulation, no EPT shadowing, etc. As a bonus, any paravirt stuff we >>>> do for pKVM x86 would also be usable for KVM-on-KVM nested virtualization. >>>> >>>> E.g. an idea floating around my head is to add a paravirt paging >>>> interface for KVM-on-KVM so that L1's (KVM-high in this RFC) doesn't >>>> need to maintain its own TDP page tables. I haven't pursued that >>>> idea in any real capacity since most nested virtualization use cases >>>> for KVM involve running an older L1 kernel and/or a non-KVM L1 >>>> hypervisor, i.e. there's no concrete use case to justify the development and >> maintenance cost. But if the PV code is "needed" by pKVM anyways... >>> >>> Yes, I agree, we could have performance & mem cost benefit by using >>> paravirt stuff for KVM-on-KVM nested virtualization. May I know do I >>> miss other benefit you saw? >> >> As I see it, the advantages of a PV design for pKVM are: >> >> - performance >> - memory cost >> - code simplicity (of the pKVM hypervisor, first of all) >> - better alignment with the pKVM on ARM >> >> Regarding performance, I actually suspect it may even be the least significant of >> the above. I guess with a PV design we'd have roughly as many extra vmexits as >> we have now (just due to hypercalls instead of traps on emulated VMX >> instructions etc), so perhaps the performance improvement would be not as big >> as we might expect (am I wrong?). > > I think with PV design, we can benefit from skip shadowing. For example, a TLB flush > could be done in hypervisor directly, while shadowing EPT need emulate it by destroy > shadow EPT page table entries then do next shadowing upon ept violation. Yeah indeed, good point. Is my understanding correct: TLB flush is still gonna be requested by the host VM via a hypercall, but the benefit is that the hypervisor merely needs to do INVEPT? > > Based on PV, with well-designed interfaces, I suppose we can also make some general > design for nested support on KVM-on-hypervisor (e.g., we can do first for KVM-on-KVM > then extend to support KVM-on-pKVM and others) Yep, as Sean suggested. Forgot to mention this too. > >> >> But the memory cost advantage seems to be very attractive. With the emulated >> design pKVM needs to maintain shadow page tables (and other shadow >> structures too, but page tables are the most memory demanding). Moreover, >> the number of shadow page tables is obviously proportional to the number of >> VMs running, and since pKVM reserves all its memory upfront preparing for the >> worst case, we have pretty restrictive limits on the maximum number of VMs [*] >> (and if we run fewer VMs than this limit, we waste memory). >> >> To give some numbers, on a machine with 8GB of RAM, on ChromeOS with this >> pKVM-on-x86 PoC currently we have pKVM memory cost of 229MB (and it only >> allows up to 10 VMs running simultaneously), while on Android (ARM) it is afaik >> only 44MB. According to my analysis, if we get rid of all the shadow tables in >> pKVM, we should have 44MB on x86 too (regardless of the maximum number of >> VMs). >> >> [*] And some other limits too, e.g. on the maximum number of DMA-capable >> devices, since pKVM also needs shadow IOMMU page tables if we have only 1- >> stage IOMMU. > > I may not capture your meaning. Do you mean device want 2-stage while we only > have 1-stage IOMMU? If so, not sure if there is real use case. > > Per my understanding, if for PV IOMMU, the simplest implementation is just > maintain 1-stage DMA mapping in the hypervisor as guest most likely just want > 1-stage DMA mapping for its device, so if for IOMMU w/ nested capability meantime > guest want use its nested capability (e.g., for vSVA), we can further extend the PV > IOMMU interfaces. Sorry, I wasn't clear enough. I mean, on the host or guest side we need just 1-stage IOMMU, but pKVM needs to ensure memory protection. So if 2-stage is available, pKVM can just use it, but if not, currently in pKVM on Intel we use shadow page tables for that (just as a consequence of the overall "mostly emulated" design). (So as a result, in particular, pKVM memory footprint depends on the max number of PCI devices allowed by pKVM.) And yeah, with a PV IOMMU we can avoid the need for shadow page tables while still having only 1-stage IOMMU, that's exactly my point. > >> >>> >>>> >>>> [*] You take the blue pill, the story ends, you wake up in your bed and believe >>>> whatever you want to believe. You take the red pill, you stay in wonderland, >>>> and I show you how deep the rabbit hole goes. >>>> >>>> -Morpheus >>>