Re: [RFC PATCH part-5 00/22] VMX emulation

Dmytro Maluka <dmy@xxxxxxxxxxxx> · Thu, 8 Jun 2023 23:38:01 +0200

On 3/14/23 17:29, Jason Chen CJ wrote:
> On Mon, Mar 13, 2023 at 09:58:27AM -0700, Sean Christopherson wrote:
>> On Mon, Mar 13, 2023, Jason Chen CJ wrote:
>>> This patch set is part-5 of this RFC patches. It introduces VMX
>>> emulation for pKVM on Intel platform.
>>>
>>> Host VM wants the capability to run its guest, it needs VMX support.
>>
>> No, the host VM only needs a way to request pKVM to run a VM.  If we go down the
>> rabbit hole of pKVM on x86, I think we should take the red pill[*] and go all the
>> way down said rabbit hole by heavily paravirtualizing the KVM=>pKVM interface.
> 
> hi, Sean,
> 
> Like I mentioned in the reply for "[RFC PATCH part-1 0/5] pKVM on Intel
> Platform Introduction", we hope VMX emulation can be there at least for
> normal VM support.
> 
>>
>> Except for VMCALL vs. VMMCALL, it should be possible to eliminate all traces of
>> VMX and SVM from the interface.  That means no VMCS emulation, no EPT shadowing,
>> etc.  As a bonus, any paravirt stuff we do for pKVM x86 would also be usable for
>> KVM-on-KVM nested virtualization.
>>
>> E.g. an idea floating around my head is to add a paravirt paging interface for
>> KVM-on-KVM so that L1's (KVM-high in this RFC) doesn't need to maintain its own
>> TDP page tables.  I haven't pursued that idea in any real capacity since most
>> nested virtualization use cases for KVM involve running an older L1 kernel and/or
>> a non-KVM L1 hypervisor, i.e. there's no concrete use case to justify the development
>> and maintenance cost.  But if the PV code is "needed" by pKVM anyways...
> 
> Yes, I agree, we could have performance & mem cost benefit by using
> paravirt stuff for KVM-on-KVM nested virtualization. May I know do I
> miss other benefit you saw?

As I see it, the advantages of a PV design for pKVM are:

- performance
- memory cost
- code simplicity (of the pKVM hypervisor, first of all)
- better alignment with the pKVM on ARM

Regarding performance, I actually suspect it may even be the least significant
of the above. I guess with a PV design we'd have roughly as many extra vmexits
as we have now (just due to hypercalls instead of traps on emulated VMX
instructions etc), so perhaps the performance improvement would be not as big
as we might expect (am I wrong?).

But the memory cost advantage seems to be very attractive. With the emulated
design pKVM needs to maintain shadow page tables (and other shadow structures
too, but page tables are the most memory demanding). Moreover, the number of
shadow page tables is obviously proportional to the number of VMs running, and
since pKVM reserves all its memory upfront preparing for the worst case, we
have pretty restrictive limits on the maximum number of VMs [*] (and if we run
fewer VMs than this limit, we waste memory).

To give some numbers, on a machine with 8GB of RAM, on ChromeOS with this
pKVM-on-x86 PoC currently we have pKVM memory cost of 229MB (and it only allows
up to 10 VMs running simultaneously), while on Android (ARM) it is afaik only
44MB. According to my analysis, if we get rid of all the shadow tables in pKVM,
we should have 44MB on x86 too (regardless of the maximum number of VMs).

[*] And some other limits too, e.g. on the maximum number of DMA-capable
devices, since pKVM also needs shadow IOMMU page tables if we have only 1-stage
IOMMU.

> 
>>
>> [*] You take the blue pill, the story ends, you wake up in your bed and believe
>>     whatever you want to believe. You take the red pill, you stay in wonderland,
>>     and I show you how deep the rabbit hole goes.
>>
>>     -Morpheus
>