Re: [RFC PATCH part-5 00/22] VMX emulation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/9/23 04:07, Chen, Jason CJ wrote:
>> -----Original Message-----
>> From: Dmytro Maluka <dmy@xxxxxxxxxxxx>
>> Sent: Friday, June 9, 2023 5:38 AM
>> To: Chen, Jason CJ <jason.cj.chen@xxxxxxxxx>; Christopherson,, Sean
>> <seanjc@xxxxxxxxxx>
>> Cc: kvm@xxxxxxxxxxxxxxx; android-kvm@xxxxxxxxxx; Dmitry Torokhov
>> <dtor@xxxxxxxxxxxx>; Tomasz Nowicki <tn@xxxxxxxxxxxx>; Grzegorz Jaszczyk
>> <jaz@xxxxxxxxxxxx>; Keir Fraser <keirf@xxxxxxxxxx>
>> Subject: Re: [RFC PATCH part-5 00/22] VMX emulation
>>
>> On 3/14/23 17:29, Jason Chen CJ wrote:
>>> On Mon, Mar 13, 2023 at 09:58:27AM -0700, Sean Christopherson wrote:
>>>> On Mon, Mar 13, 2023, Jason Chen CJ wrote:
>>>>> This patch set is part-5 of this RFC patches. It introduces VMX
>>>>> emulation for pKVM on Intel platform.
>>>>>
>>>>> Host VM wants the capability to run its guest, it needs VMX support.
>>>>
>>>> No, the host VM only needs a way to request pKVM to run a VM.  If we
>>>> go down the rabbit hole of pKVM on x86, I think we should take the
>>>> red pill[*] and go all the way down said rabbit hole by heavily paravirtualizing
>> the KVM=>pKVM interface.
>>>
>>> hi, Sean,
>>>
>>> Like I mentioned in the reply for "[RFC PATCH part-1 0/5] pKVM on
>>> Intel Platform Introduction", we hope VMX emulation can be there at
>>> least for normal VM support.
>>>
>>>>
>>>> Except for VMCALL vs. VMMCALL, it should be possible to eliminate all
>>>> traces of VMX and SVM from the interface.  That means no VMCS
>>>> emulation, no EPT shadowing, etc.  As a bonus, any paravirt stuff we
>>>> do for pKVM x86 would also be usable for KVM-on-KVM nested virtualization.
>>>>
>>>> E.g. an idea floating around my head is to add a paravirt paging
>>>> interface for KVM-on-KVM so that L1's (KVM-high in this RFC) doesn't
>>>> need to maintain its own TDP page tables.  I haven't pursued that
>>>> idea in any real capacity since most nested virtualization use cases
>>>> for KVM involve running an older L1 kernel and/or a non-KVM L1
>>>> hypervisor, i.e. there's no concrete use case to justify the development and
>> maintenance cost.  But if the PV code is "needed" by pKVM anyways...
>>>
>>> Yes, I agree, we could have performance & mem cost benefit by using
>>> paravirt stuff for KVM-on-KVM nested virtualization. May I know do I
>>> miss other benefit you saw?
>>
>> As I see it, the advantages of a PV design for pKVM are:
>>
>> - performance
>> - memory cost
>> - code simplicity (of the pKVM hypervisor, first of all)
>> - better alignment with the pKVM on ARM
>>
>> Regarding performance, I actually suspect it may even be the least significant of
>> the above. I guess with a PV design we'd have roughly as many extra vmexits as
>> we have now (just due to hypercalls instead of traps on emulated VMX
>> instructions etc), so perhaps the performance improvement would be not as big
>> as we might expect (am I wrong?).
> 
> I think with PV design, we can benefit from skip shadowing. For example, a TLB flush
> could be done in hypervisor directly, while shadowing EPT need emulate it by destroy
> shadow EPT page table entries then do next shadowing upon ept violation.

Yeah indeed, good point.

Is my understanding correct: TLB flush is still gonna be requested by
the host VM via a hypercall, but the benefit is that the hypervisor
merely needs to do INVEPT?

> 
> Based on PV, with well-designed interfaces, I suppose we can also make some general
> design for nested support on KVM-on-hypervisor (e.g., we can do first for KVM-on-KVM
> then extend to support KVM-on-pKVM and others)

Yep, as Sean suggested. Forgot to mention this too.

> 
>>
>> But the memory cost advantage seems to be very attractive. With the emulated
>> design pKVM needs to maintain shadow page tables (and other shadow
>> structures too, but page tables are the most memory demanding). Moreover,
>> the number of shadow page tables is obviously proportional to the number of
>> VMs running, and since pKVM reserves all its memory upfront preparing for the
>> worst case, we have pretty restrictive limits on the maximum number of VMs [*]
>> (and if we run fewer VMs than this limit, we waste memory).
>>
>> To give some numbers, on a machine with 8GB of RAM, on ChromeOS with this
>> pKVM-on-x86 PoC currently we have pKVM memory cost of 229MB (and it only
>> allows up to 10 VMs running simultaneously), while on Android (ARM) it is afaik
>> only 44MB. According to my analysis, if we get rid of all the shadow tables in
>> pKVM, we should have 44MB on x86 too (regardless of the maximum number of
>> VMs).
>>
>> [*] And some other limits too, e.g. on the maximum number of DMA-capable
>> devices, since pKVM also needs shadow IOMMU page tables if we have only 1-
>> stage IOMMU.
> 
> I may not capture your meaning. Do you mean device want 2-stage while we only
> have 1-stage IOMMU? If so, not sure if there is real use case.
> 
> Per my understanding, if for PV IOMMU, the simplest implementation is just
> maintain 1-stage DMA mapping in the hypervisor as guest most likely just want 
> 1-stage DMA mapping for its device,  so if for IOMMU w/ nested capability meantime
> guest want use its nested capability (e.g., for vSVA), we can further extend the PV
> IOMMU interfaces.

Sorry, I wasn't clear enough. I mean, on the host or guest side we need
just 1-stage IOMMU, but pKVM needs to ensure memory protection. So if
2-stage is available, pKVM can just use it, but if not, currently in
pKVM on Intel we use shadow page tables for that (just as a consequence
of the overall "mostly emulated" design). (So as a result, in
particular, pKVM memory footprint depends on the max number of PCI
devices allowed by pKVM.) And yeah, with a PV IOMMU we can avoid the
need for shadow page tables while still having only 1-stage IOMMU,
that's exactly my point.

> 
>>
>>>
>>>>
>>>> [*] You take the blue pill, the story ends, you wake up in your bed and believe
>>>>     whatever you want to believe. You take the red pill, you stay in wonderland,
>>>>     and I show you how deep the rabbit hole goes.
>>>>
>>>>     -Morpheus
>>>



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux