Re: [PATCH 2/2] KVM: x86: introduce cache configurations for previous CR3s

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 5 Nov 2024 17:42:26 -0800

On Wed, Oct 30, 2024, zhuangel570 wrote:
> On Wed, Oct 30, 2024 at 4:38 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Tue, Oct 29, 2024, Yong He wrote:
> > The only potential downside to larger caches I can think of, is that keeping
> > root_count elevated would make it more difficult to reclaim shadow pages from
> > roots that are no longer relevant to the guest.  kvm_mmu_zap_oldest_mmu_pages()
> > in particular would refuse to reclaim roots.  That shouldn't be problematic for
> > legacy shadow paging, because KVM doesn't recursively zap shadow pages.  But for
> > nested TDP, mmu_page_zap_pte() frees the entire tree, in the common case that
> > child SPTEs aren't shared across multiple trees (common in legacy shadow paging,
> > extremely uncommon in nested TDP).
> >
> > And for the nested TDP issue, if it's actually a problem, I would *love* to
> > solve that problem by making KVM's forced reclaim more sophisticated.  E.g. one
> > idea would be to kick all vCPUs if the maximum number of pages has been reached,
> > have each vCPU purge old roots from prev_roots, and then reclaim unused roots.
> > It would be a bit more complicated than that, as KVM would need a way to ensure
> > forward progress, e.g. if the shadow pages limit has been reach with a single
> > root.  But even then, kvm_mmu_zap_oldest_mmu_pages() could be made a _lot_ smarter.
> 
> I not very familiar with TDP on TDP.
> I think you mean force free cached roots in kvm_mmu_zap_oldest_mmu_pages() when
> no mmu pages could be zapped. Such as kick all VCPUs and purge cached roots.

Not just when no MMU pages could be zapped; any time KVM needs to reclaim MMU
pages due to n_max_mmu_pages.

> > TL;DR: what if we simply bump the number of cached roots to ~16?
> 
> I set the number to 11 because the PCID in guest kernel is 6 (11+current=12),
> when there are more than 6 processes in guest, the PCID will be reused, then
> cached roots will not easily to hit.  The context switch case shows no
> performance gain when process are 7 and 8.

Do you control the guest kernel?  If so, it'd be interesting to see what happens
when you bump TLB_NR_DYN_ASIDS in the guest to something higher, and then adjust
KVM to match.

IIRC, Andy arrived at '6' in 10af6235e0d3 ("x86/mm: Implement PCID based optimization:
try to preserve old TLB entries using PCID") because that was the "sweet spot" for
hardware.  E.g. using fewer PCIDs wouldn't fully utilize hardware, and using more
PCIDs would oversubscribe the number of ASID tags too much.

For KVM shadow paging, the only meaningful limitation is the number of shadow
pages that KVM allows.  E.g. with a sufficiently high n_max_mmu_pages, the guest
could theoretically use hundreds of PCIDs will no ill effects.