Re: [PATCH 2/2] KVM: x86: introduce cache configurations for previous CR3s

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 29 Oct 2024 08:14:50 -0700

On Tue, Oct 29, 2024, Yong He wrote:
> From: Yong He <alexyonghe@xxxxxxxxxxx>
> 
> Introduce prev_roots_num param, so that we use more cache of
> previous CR3/root_hpa pairs, which help us to reduce shadow
> page table evict and rebuild overhead.
> 
> Signed-off-by: Yong He <alexyonghe@xxxxxxxxxxx>
> ---

...

> +uint __read_mostly prev_roots_num = KVM_MMU_NUM_PREV_ROOTS;
> +EXPORT_SYMBOL_GPL(prev_roots_num);
> +module_param_cb(prev_roots_num, &prev_roots_num_ops,
> +		&prev_roots_num, 0644);

Allowing the variable to be changed while KVM is running is unsafe.

I also think a module param is the wrong way to try to allow for bigger caches.
The caches themselves are relatively cheap, at 16 bytes per entry.  And I doubt
the cost of searching a larger cache in fast_pgd_switch() would have a measurable
impact, since the most recently used roots will be at the front of the cache,
i.e. only near-misses and misses will be affected.

The only potential downside to larger caches I can think of, is that keeping
root_count elevated would make it more difficult to reclaim shadow pages from
roots that are no longer relevant to the guest.  kvm_mmu_zap_oldest_mmu_pages()
in particular would refuse to reclaim roots.  That shouldn't be problematic for
legacy shadow paging, because KVM doesn't recursively zap shadow pages.  But for
nested TDP, mmu_page_zap_pte() frees the entire tree, in the common case that
child SPTEs aren't shared across multiple trees (common in legacy shadow paging,
extremely uncommon in nested TDP).

And for the nested TDP issue, if it's actually a problem, I would *love* to
solve that problem by making KVM's forced reclaim more sophisticated.  E.g. one
idea would be to kick all vCPUs if the maximum number of pages has been reached,
have each vCPU purge old roots from prev_roots, and then reclaim unused roots.
It would be a bit more complicated than that, as KVM would need a way to ensure
forward progress, e.g. if the shadow pages limit has been reach with a single
root.  But even then, kvm_mmu_zap_oldest_mmu_pages() could be made a _lot_ smarter.

TL;DR: what if we simply bump the number of cached roots to ~16?