Re: [PATCH v10 2/3] KVM: x86: Dirty quota-based throttling of vcpus

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 16 Apr 2024 10:44:18 -0700

On Wed, Feb 21, 2024, Shivam Kumar wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2d6cdeab1f8a..fa0b3853ee31 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3397,8 +3397,12 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu,
>  	if (!try_cmpxchg64(sptep, &old_spte, new_spte))
>  		return false;
>  
> -	if (is_writable_pte(new_spte) && !is_writable_pte(old_spte))
> +	if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) {
> +		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +
> +		update_dirty_quota(vcpu->kvm, (1L << SPTE_LEVEL_SHIFT(sp->role.level)));
>  		mark_page_dirty_in_slot(vcpu->kvm, fault->slot, fault->gfn);

Forcing KVM to manually call update_dirty_quota() whenever mark_page_dirty_in_slot()
is invoked is not maintainable, as we inevitably will forget to update the quota
and probably not notice.  We've already had bugs escape where KVM fails to mark
gfns dirty, and those flows are much more testable.

Stepping back, I feel like this series has gone off the rails a bit.

I understand Marc's objections to the uAPI not differentiating between page sizes,
but simply updating the quota based on KVM's page size is also flawed.  E.g. if
the guest is backed with 1GiB pages, odds are very good that the dirty quotas are
going to be completely out of whack due to the first vCPU that writes a given 1GiB
region being charged with the entire 1GiB page.

And without a way to trigger detection of writes, e.g. by enabling PML or write-
protecting memory, I don't see how userspace can build anything on the "bytes
dirtied" information.

>From v7[*], Marc was specifically objecting to the proposed API effectively being
presented as a general purpose API, but in reality the API was heavily reliant
on dirty logging being enabled.

 : My earlier comments still stand: the proposed API is not usable as a
 : general purpose memory-tracking API because it counts faults instead
 : of memory, making it inadequate except for the most trivial cases.
 : And I cannot believe you were serious when you mentioned that you were
 : happy to make that the API.

To avoid going in circles, I think we need to first agree on the scope of the uAPI.
Specifically, do we want to shoot for a generic write-tracking API, or do we want
something that is explicitly tied to dirty logging?

Marc,

If we figured out a clean-ish way to tie the "gfns dirtied" information to
dirty logging, i.e. didn't misconstrue the counts as generally useful data, would
that be acceptable?  While I like the idea of a generic solution, I don't see a
path to an implementation that isn't deeply flawed without basically doing dirty
logging, i.e. without forcing the use of non-huge pages and write-protecting memory
to intercept "new" writes based on input from userspace.

[*] https://lore.kernel.org/all/20221113170507.208810-2-shivam.kumar1@xxxxxxxxxxx