On Sat, 21 May 2022 21:29:36 +0100, Shivam Kumar <shivam.kumar1@xxxxxxxxxxx> wrote: > > Define variables to track and throttle memory dirtying for every vcpu. > > dirty_count: Number of pages the vcpu has dirtied since its creation, > while dirty logging is enabled. > dirty_quota: Number of pages the vcpu is allowed to dirty. To dirty > more, it needs to request more quota by exiting to > userspace. > > Implement the flow for throttling based on dirty quota. > > i) Increment dirty_count for the vcpu whenever it dirties a page. > ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty > count equals/exceeds dirty quota) to request more dirty quota. > > Suggested-by: Shaju Abraham <shaju.abraham@xxxxxxxxxxx> > Suggested-by: Manish Mishra <manish.mishra@xxxxxxxxxxx> > Co-developed-by: Anurag Madnawat <anurag.madnawat@xxxxxxxxxxx> > Signed-off-by: Anurag Madnawat <anurag.madnawat@xxxxxxxxxxx> > Signed-off-by: Shivam Kumar <shivam.kumar1@xxxxxxxxxxx> > --- > Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++ > arch/x86/kvm/mmu/spte.c | 4 ++-- > arch/x86/kvm/vmx/vmx.c | 3 +++ > arch/x86/kvm/x86.c | 4 ++++ Please split the x86 support in its own patch. > include/linux/kvm_host.h | 15 +++++++++++++++ > include/linux/kvm_types.h | 1 + > include/uapi/linux/kvm.h | 12 ++++++++++++ > virt/kvm/kvm_main.c | 7 ++++++- > 8 files changed, 75 insertions(+), 3 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index 9f3172376ec3..a9317ed31d06 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6125,6 +6125,24 @@ array field represents return values. The userspace should update the return > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */ > + struct { > + __u64 count; > + __u64 quota; > + } dirty_quota_exit; > +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has > +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure > +makes the following information available to the userspace: > + 'count' field: the current count of pages dirtied by the VCPU, can be > + skewed based on the size of the pages accessed by each vCPU. > + 'quota' field: the observed dirty quota just before the exit to userspace. > +The userspace can design a strategy to allocate the overall scope of dirtying > +for the VM among the vcpus. Based on the strategy and the current state of dirty > +quota throttling, the userspace can make a decision to either update (increase) > +the quota or to put the VCPU to sleep for some time. > + > :: > > /* Fix the size of the union. */ > @@ -6159,6 +6177,20 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set. > > :: > > + /* > + * Number of pages the vCPU is allowed to have dirtied over its entire > + * lifetime. KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota > + * is reached/exceeded. > + */ > + __u64 dirty_quota; > +Please note that enforcing the quota is best effort, as the guest may dirty > +multiple pages before KVM can recheck the quota. However, unless KVM is using > +a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging, > +KVM will detect quota exhaustion within a handful of dirtied page. If a > +hardware ring buffer is used, the overrun is bounded by the size of the buffer > +(512 entries for PML). > + > +:: > }; > > > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c > index 73cfe62fdad1..01f0d2a04796 100644 > --- a/arch/x86/kvm/mmu/spte.c > +++ b/arch/x86/kvm/mmu/spte.c > @@ -182,9 +182,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, > "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level, > get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level)); > > - if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) { > + if (spte & PT_WRITABLE_MASK) { > /* Enforced by kvm_mmu_hugepage_adjust. */ > - WARN_ON(level > PG_LEVEL_4K); > + WARN_ON(level > PG_LEVEL_4K && kvm_slot_dirty_track_enabled(slot)); > mark_page_dirty_in_slot(vcpu->kvm, slot, gfn); > } > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index b730d799c26e..5cbe4992692a 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -5507,6 +5507,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) > */ > if (__xfer_to_guest_mode_work_pending()) > return 1; > + > + if (!kvm_vcpu_check_dirty_quota(vcpu)) > + return 0; > } > > return 1; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index eb4029660bd9..0b35b8cc0274 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -10257,6 +10257,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu) > vcpu->arch.l1tf_flush_l1d = true; > > for (;;) { > + r = kvm_vcpu_check_dirty_quota(vcpu); I really wonder why this has to be checked on each and every run. Why isn't this a request set by the core code instead? > + if (!r) > + break; > + > if (kvm_vcpu_running(vcpu)) { > r = vcpu_enter_guest(vcpu); > } else { > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index f11039944c08..ca1ac970a6cf 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -530,6 +530,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu) > return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE); > } > > +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu) Why is this inline instead of a normal call? > +{ > + struct kvm_run *run = vcpu->run; > + u64 dirty_quota = READ_ONCE(run->dirty_quota); > + u64 pages_dirtied = vcpu->stat.generic.pages_dirtied; > + > + if (!dirty_quota || (pages_dirtied < dirty_quota)) > + return 1; What happens when page_dirtied becomes large and dirty_quota has to wrap to allow further progress? > + > + run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED; > + run->dirty_quota_exit.count = pages_dirtied; > + run->dirty_quota_exit.quota = dirty_quota; > + return 0; > +} > + > /* > * Some of the bitops functions do not support too long bitmaps. > * This number must be determined not to exceed such limits. > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h > index dceac12c1ce5..7f42486b0405 100644 > --- a/include/linux/kvm_types.h > +++ b/include/linux/kvm_types.h > @@ -106,6 +106,7 @@ struct kvm_vcpu_stat_generic { > u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT]; > u64 halt_wait_hist[HALT_POLL_HIST_COUNT]; > u64 blocking; > + u64 pages_dirtied; > }; > > #define KVM_STATS_NAME_SIZE 48 > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index 507ee1f2aa96..1d9531efe1fb 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -270,6 +270,7 @@ struct kvm_xen_exit { > #define KVM_EXIT_X86_BUS_LOCK 33 > #define KVM_EXIT_XEN 34 > #define KVM_EXIT_RISCV_SBI 35 > +#define KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 36 > > /* For KVM_EXIT_INTERNAL_ERROR */ > /* Emulate instruction failed. */ > @@ -487,6 +488,11 @@ struct kvm_run { > unsigned long args[6]; > unsigned long ret[2]; > } riscv_sbi; > + /* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */ > + struct { > + __u64 count; > + __u64 quota; > + } dirty_quota_exit; > /* Fix the size of the union. */ > char padding[256]; > }; > @@ -508,6 +514,12 @@ struct kvm_run { > struct kvm_sync_regs regs; > char padding[SYNC_REGS_SIZE_BYTES]; > } s; > + /* > + * Number of pages the vCPU is allowed to have dirtied over its entire > + * liftime. KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the lifetime > + * quota is reached/exceeded. > + */ > + __u64 dirty_quota; How is the feature detected from userspace? I don't see how it can make use of it without knowing about it the first place. > }; > > /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */ > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 0afc016cc54d..041ab464405d 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -3163,7 +3163,12 @@ void mark_page_dirty_in_slot(struct kvm *kvm, > return; > #endif > > - if (memslot && kvm_slot_dirty_track_enabled(memslot)) { > + if (!memslot) > + return; > + > + vcpu->stat.generic.pages_dirtied++; > + > + if (kvm_slot_dirty_track_enabled(memslot)) { > unsigned long rel_gfn = gfn - memslot->base_gfn; > u32 slot = (memslot->as_id << 16) | memslot->id; > Thanks, M. -- Without deviation from the norm, progress is not possible.