Re: [PATCH 06/23] KVM: MMU: load new PGD once nested two-dimensional paging is initialized

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Wed, 9 Feb 2022 13:34:20 +0100

On 2/4/22 20:18, David Matlack wrote:
On Fri, Feb 04, 2022 at 06:57:01AM -0500, Paolo Bonzini wrote:
__kvm_mmu_new_pgd looks at the MMU's root_level and shadow_root_level
via fast_pgd_switch.

Those checks are just for performance correct (to skip iterating through
the list of roots)?

Either way, it's probably worth including a Fixes tag below.

It's actually a much bigger mess---though it's working as intended, 
except that in some cases the caching is suboptimal.

Basically, you have to call __kvm_mmu_new_pgd *before* kvm_init_mmu 
because of how fast_pgd_switch takes care of stashing the current root 
in the cache.   A PAE root is never placed in the cache exactly because 
mmu->root_level and mmu->shadow_root_level hold the value for the 
_current_ root.  In that case, fast_pgd_switch does not look for a 
cached PGD (even if according to the role it could be there).

__kvm_mmu_new_pgd then frees the current root using kvm_mmu_free_roots. 
 kvm_mmu_free_roots *also* uses mmu->root_level and 
mmu->shadow_root_level to distinguish whether the page table uses a 
single root or 4 PAE roots.  Because kvm_init_mmu can overwrite 
muu->root_level, kvm_mmu_free_roots must also be called before kvm_init_mmu.

I do wonder if the use of mmu->shadow_root_level was intentional (it 
certainly escaped me when first reviewing fast PGD switching), but 
fortunately none of this is necessary.  PAE roots can be identified from 
!to_shadow_page(root_hpa), so the better fix is to do that:

- in fast_pgd_switch/cached_root_available, you need to account for two 
possible transformations of the cache: either the old entry becomes the 
MRU of the prev_roots as in the current code, or the old entry cannot be 
cached.  This is 20-30 more lines of code.

- in kvm_mmu_free_roots, just use to_shadow_page instead of 
mmu->root_level and mmu->shadow_root_level.

Once this is in place, the original bug is fixed simply by calling 
kvm_mmu_new_pgd after kvm_init_mmu.  kvm_mmu_reset_context is not doing 
now to avoid having to figure out the new role, but it can do that 
easily after the above change.

I'll keep this cpu_role refactoring around, but strictly speaking it 
becomes a separate change than the optimization.

Paolo