Re: [PATCH 00/11] Shadow Paging performance improvements

Liran Alon <liran.alon@xxxxxxxxxx> · Mon, 14 May 2018 07:32:29 -0700 (PDT)

----- junaids@xxxxxxxxxx wrote:

> The performance of shadow paging is severely degraded in some
> workloads
> when the guest kernel is using KPTI. This is primarily due to the
> vastly
> increased number of CR3 switches that result from KPTI.
> 
> This patch series implements various optimizations to reduce some of
> this
> overhead. Compared to the baseline, this results in a reduction from
> ~16m12s to ~4m44s for a 4-VCPU kernel compile benchmark and from
> ~25m5s to
> ~14m50s for a 1-VCPU kernel compile benchmark.
> 
> Junaid Shahid (11):
>   kvm: x86: Make sync_page() flush remote TLBs once only
>   kvm: x86: Refactor mmu_free_roots()
>   kvm: x86: Add fast CR3 switch code path
>   kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabled
>   kvm: x86: Add ability to skip TLB flush when switching CR3
>   kvm: x86: Map guest PCIDs to host PCIDs
>   kvm: vmx: Support INVPCID in shadow paging mode
>   kvm: x86: Skip TLB flush on fast CR3 switch when indicated by guest
>   kvm: x86: Add a root_hpa parameter to kvm_mmu->invlpg()
>   kvm: x86: Skip shadow page resync on CR3 switch when indicated by
>     guest
>   kvm: x86: Flush only affected TLB entries in kvm_mmu_invlpg*
> 

This patch series was very interesting to review.

One of our (Oracle) products have a propriety binary-translation hypervisor which works with shadow paging as-well. A few months ago, because of guests starting to use KPTI,
I have encountered the same performance hit and implemented PCID & INVPCID support
on this hypervisor Shadow MMU.

What is quite interesting is that I have taken a completely different approach to handle this:

I vision SPTs (shadow page-tables) as equivalent to virtual TLB entries.
Because PCID tags TLB entries, I decided to add a PCID tag to every SPT.
When SPT is created, current active PCID is attached to it.
When SPTs are searched, current active PCID is also considered such that one could have multiple SPTs
mirroring the same GPT (guest page-table) at same paging-mode and paging-level but with different PCID.
Then, handling bit 63 in CR3 is quite straight-forward: Just don't sync SPTs when it is set on CR3-switch.
Note that host PCID support is not a must and PCID can be emulated to guest without it.
(Of course host PCID support can be utilized to improve physical TLB utilization same as done in this series).

At this point, with very few code modifications, PCID is completely supported but not INVPCID.
To add support for INVPCID, I had some simple ideas:
1) Maintain Root-SPTs in searchable-by-PCID data-structures to easily invalidate them as needed.
2) Maintain bitmap for all possible PCIDs (0x1000 bits) per vCPU that will mark "pending-PCID-flush".
When INVPCID is invoked for a PCID different than active one, relevant bit in bitmap will be set.
Then, on CR3-switch, if PCID is changed to one that has it's bit set in bitmap, then hypervisor will
sync SPTs even if bit 63 in CR3 was set. After CR3-switch, bit is cleared from bitmap.
Thus, basically deffering PCID entries flush to next switch to that PCID.

Eventually I implemented (2) just because it was simpler to implement and
gave good enough performance for our needs.
It could also be further optimized by having a per-PCID array of X VAs to virtual-invlpg on next switch to that PCID. If array is full, fallback to treat cr3-switch as done with "TLB flush" even if bit 63 is set.

The above approach worked pretty-well and this made me wonder how the approach suggested by
this patch series compares to my approach.
I think there is an interesting trade-off between these 2 approaches:
My approach is simpler and support PCID & INVPCID generically without limiting performance gain to specific workload of KPTI usage of PCID. For example, guest that use PCIDs for minimizing TLB flushes (Linux has such mechanism) would have improved performance.
However, the main drawback of this approach is that GPTs mapping user-space portion of process will have
multiple SPTs mirroring them for user/kernel PCIDs. Therefore, this approach utilize more SPT pages and therefore could result in more MMU flushes.
This is in contrast to this patch series which can share SPTs mirroring the GPTs mapping
user-space by both user PCID and kernel PCID.

Turned out to be quite long but wanted to just give my two cents analysis on this series. :)
Very nicely done.

Regards,
-Liran