Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 1 Dec 2021 23:37:40 +0000

On Wed, Dec 01, 2021, David Matlack wrote:
> On Wed, Dec 1, 2021 at 10:29 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > Hmm, in this particular case, I think using the caches is the wrong approach.  The
> > behavior of pre-filling the caches makes sense for vCPUs because faults may need
> > multiple objects and filling the cache ensures the entire fault can be handled
> > without dropping mmu_lock.  And any extra/unused objects can be used by future
> > faults.  For page splitting, neither of those really holds true.  If there are a
> > lot of pages to split, KVM will have to drop mmu_lock to refill the cache.  And if
> > there are few pages to split, or the caches are refilled toward the end of the walk,
> > KVM may end up with a pile of unused objects it needs to free.
> >
> > Since this code already needs to handle failure, and more importantly, it's a
> > best-effort optimization, I think trying to use the caches is a square peg, round
> > hole scenario.
> >
> > Rather than use the caches, we could do allocation 100% on-demand and never drop
> > mmu_lock to do allocation.  The one caveat is that direct reclaim would need to be
> > disallowed so that the allocation won't sleep.  That would mean that eager splitting
> > would fail under heavy memory pressure when it otherwise might succeed by reclaiming.
> > That would mean vCPUs get penalized as they'd need to do the splitting on fault and
> > potentially do direct reclaim as well.  It's not obvious that that would be a problem
> > in practice, e.g. the vCPU is probably already seeing a fair amount of disruption due
> > to memory pressure, and slowing down vCPUs might alleviate some of that pressure.
> 
> Not necessarily. The vCPUs might be running just fine in the VM being
> split because they are in their steady state and not faulting in any
> new memory. (Memory pressure might be coming from another VM landing
> on the host.)

Hrm, true.

> IMO, if we have an opportunity to avoid doing direct reclaim in the
> critical path of customer execution we should take it.
>
> 
> The on-demand approach will also increase the amount of time we have
> to hold the MMU lock to page splitting. This is not too terrible for
> the TDP MMU since we are holding the MMU lock in read mode, but is
> going to become a problem when we add page splitting support for the
> shadow MMU.
> 
> I do agree that the caches approach, as implemented, will inevitably
> end up with a pile of unused objects at the end that need to be freed.
> I'd be happy to take a look and see if there's anyway to reduce the
> amount of unused objects at the end with a bit smarter top-up logic.

It's not just the extra objects, it's the overall complexity that bothers me.
Complexity isn't really the correct word, it's more that as written, the logic
is spread over several files and is disingenuous from the perspective that the
split_cache is in kvm->arch, which implies persistence, but the cache are
completely torn down after evey memslot split.

I suspect part of the problem is that the code is trying to plan for a future
where nested MMUs also support splitting large pages.  Usually I'm all for that
sort of thing, but in this case it creates a lot of APIs that should not exist,
either because the function is not needed at all, or because it's a helper buried
in tdp_mmu.c.  E.g. assert_split_caches_invariants() is overkill.

That's solvable by refactoring and shuffling code, but using kvm_mmu_memory_cache
still feels wrong.  The caches don't fully solve the might_sleep() problem since
the loop still has to drop mmu_lock purely because it needs to allocate memory,
and at the same time the caches are too agressive because we can theoretically get
false positives on OOM scenarios, e.g. a topup could fail when trying to allocate
25 objects, when only 1 is needed.  We could enhance the cache code, which is
pretty rudimentary, but it still feels forced.

One thing we can take advantage of is that remote TLB flushes can be deferred
until after all roots are done, and don't need to be serviced if mmu_lock is
dropped.  Changes from a hugepage to a collection of smaller pages is atomic, no
memory is freed, and there are no changes in gfn=>pfn made by the split.  If
something else comes along and modifies the newly created sp or its children,
then it will flush accordingly.  Similar to write-protecting the page, the only
requirement is that all vCPUs see the small pages before the ioctl() returns,
i.e. before userspace can query the dirty log.  Never needing to flush is one
less reason to use a variant of tdp_mmu_iter_cond_resched(). 

So, what if we do something like this?  Try to allocate on-demand without dropping
mmu_lock.  In the happy case, it will succeed and there's no need to drop mmu_lock.
If allocation fails, drop RCU and mmu_lock and retry with direct relcaim allowed.

Some ugly gotos to reduce indentation, there's probably a better way to dress
this up.  Comments obviously needed.  This also doesn't track whether or not a
flush is needed, that will sadly need to be an in/out param, assuming we want to
return success/failure.

static struct kvm_mmu_page *tdp_mmu_alloc_sp(gfp_t allow_direct_reclaim)
{
	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | allow_direct_reclaim;
	struct kvm_mmu_page *sp;
	u64 *spt;

	spt = (void *)__get_free_page(gfp);
	if (!spt)
		return NULL;

	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
	if (!sp) {
		free_page((unsigned long)spt);
		return NULL;
	}

	sp->spt = spt;

	return sp;
}

static int tdp_mmu_split_large_pages(struct kvm *kvm, struct kvm_mmu_page *root,
				     gfn_t start, gfn_t end, int target_level)
{
	struct kvm_mmu_page *sp = NULL;
	struct tdp_iter iter;

	rcu_read_lock();

	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
retry:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
			continue;

		if (!is_shadow_present_pte(iter.old_spte || !is_large_pte(pte))
			continue;

		if (likely(sp))
			goto do_split;

		sp = tdp_mmu_alloc_sp(0);
		if (!sp) {
			rcu_read_unlock();
			read_unlock(&kvm->mmu_lock);

			sp = tdp_mmu_alloc_sp(__GFP_DIRECT_RECLAIM);

			read_lock(&kvm->mmu_lock);

			if (!sp)
				return -ENOMEM;

			rcu_read_lock();
			tdp_iter_restart(iter);
			continue;
		}

do_split:
		init_tdp_mmu_page(sp, iter->gfn, get_child_page_role(&iter));

		if (!tdp_mmu_split_large_page(kvm, &iter, sp))
			goto retry;

		sp = NULL;
	}

	rcu_read_unlock();

	return 0;
}