Re: [PATCH] KVM: x86/mmu: Allow per VM kvm_mmu_max_gfn()

"Edgecombe, Rick P" <rick.p.edgecombe@xxxxxxxxx> · Tue, 16 Jul 2024 21:14:06 +0000

On Tue, 2024-07-16 at 12:31 -0700, Sean Christopherson wrote:
> 
> No, it most definitely is not more correct.  There is absolutely no issue
> zapping
> SPTEs that should never exist.  In fact, restricting the zapping path is far
> more
> likely to *cause* correctness issues, e.g. see 
> 
>   524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all
> SPTEs")
>   86931ff7207b ("KVM: x86/mmu: Do not create SPTEs for GFNs that exceed
> host.MAXPHYADDR")

The type of correctness this was going for was around the new treatment of GFNs
not having the shared/alias bit. As you know it can get confusing which
variables have these bits and which have them stripped. Part of the recent MMU
work involved making sure at least GFN's didn't contain the shared bit.

Then in TDP MMU where it iterates from start to end, for example:
static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
			      gfn_t start, gfn_t end, bool can_yield, bool
flush)
{
	struct tdp_iter iter;

	end = min(end, tdp_mmu_max_gfn_exclusive());

	lockdep_assert_held_write(&kvm->mmu_lock);

	rcu_read_lock();

	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) {
		if (can_yield &&
		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
			flush = false;
			continue;
		}
...

The math gets a bit confused. For the private/mirror root, start will begin at
0, and iterate to a range that includes the shared bit. No functional problem
because we are zapping things that shouldn't be set. But it means the 'gfn' has
the bit position of the shared bit set. Although it is not acting as the shared
bit in this case, just an out of range bit.

For the shared/direct root, it will iterate from (shared_bit | 0) to (shared_bit
| max_gfn). So where the mirror root iterates through the whole range, the
shared case skips it in the current code anyway.

And then the fact that the code already takes care here to avoid zapping over
ranges that exceed the max gfn.

So it's a bit asymmetric, and just overall weird. We are weighing functional
correctness risk with known code weirdness. My inclination was to try to reduce
the places where TDX MMU needs paths happen to work for subtle reasons for the
cost of the VM field. I think I still lean that way, but not a strong opinion.

> 
> Creating SPTEs is a different matter, but unless I'm missing something, the
> only
> path that _needs_ to be updated is kvm_arch_prepare_memory_region(), to
> disallow
> aliased memslots.
> 
> I assume TDX will strip the shared bit from fault->gfn, and shove it back in
> when
> creating MMIO SPTEs in the shared EPT page tables.
> 
> Why can't we simply do:

Seems ok. It's not dissimilar to what this patch does for memslots. For the
fault path check, it is catching something that 86931ff7207b ("KVM: x86/mmu: Do
not create SPTEs for GFNs that exceed host.MAXPHYADDR") says not to worry about.
But maybe that calculus changes with the TDX module in the loop.

We also should be ok functionally without either.