On Tue, Oct 29, 2024, Yong He wrote: > From: Yong He <alexyonghe@xxxxxxxxxxx> > > Introduce prev_roots_num param, so that we use more cache of > previous CR3/root_hpa pairs, which help us to reduce shadow > page table evict and rebuild overhead. > > Signed-off-by: Yong He <alexyonghe@xxxxxxxxxxx> > --- ... > +uint __read_mostly prev_roots_num = KVM_MMU_NUM_PREV_ROOTS; > +EXPORT_SYMBOL_GPL(prev_roots_num); > +module_param_cb(prev_roots_num, &prev_roots_num_ops, > + &prev_roots_num, 0644); Allowing the variable to be changed while KVM is running is unsafe. I also think a module param is the wrong way to try to allow for bigger caches. The caches themselves are relatively cheap, at 16 bytes per entry. And I doubt the cost of searching a larger cache in fast_pgd_switch() would have a measurable impact, since the most recently used roots will be at the front of the cache, i.e. only near-misses and misses will be affected. The only potential downside to larger caches I can think of, is that keeping root_count elevated would make it more difficult to reclaim shadow pages from roots that are no longer relevant to the guest. kvm_mmu_zap_oldest_mmu_pages() in particular would refuse to reclaim roots. That shouldn't be problematic for legacy shadow paging, because KVM doesn't recursively zap shadow pages. But for nested TDP, mmu_page_zap_pte() frees the entire tree, in the common case that child SPTEs aren't shared across multiple trees (common in legacy shadow paging, extremely uncommon in nested TDP). And for the nested TDP issue, if it's actually a problem, I would *love* to solve that problem by making KVM's forced reclaim more sophisticated. E.g. one idea would be to kick all vCPUs if the maximum number of pages has been reached, have each vCPU purge old roots from prev_roots, and then reclaim unused roots. It would be a bit more complicated than that, as KVM would need a way to ensure forward progress, e.g. if the shadow pages limit has been reach with a single root. But even then, kvm_mmu_zap_oldest_mmu_pages() could be made a _lot_ smarter. TL;DR: what if we simply bump the number of cached roots to ~16?