Re: [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus

Shivam Kumar <shivam.kumar1@xxxxxxxxxxx> · Tue, 3 May 2022 12:52:26 +0530

On 03/05/22 3:44 am, Peter Xu wrote:
Hi, Shivam,

On Sun, Mar 06, 2022 at 10:08:48PM +0000, Shivam Kumar wrote:
+static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
+{
+	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
+	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
+	struct kvm_run *run = vcpu->run;
+
+	if (!dirty_quota || (pages_dirtied < dirty_quota))
+		return 1;
+
+	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
+	run->dirty_quota_exit.count = pages_dirtied;
+	run->dirty_quota_exit.quota = dirty_quota;
Pure question: why this needs to be returned to userspace?  Is this value
set from userspace?

1) The quota needs to be replenished once exhasuted.
2) The vcpu should be made to sleep if it has consumed its quota pretty 
quick.

Both these actions are performed on the userspace side, where we expect 
a thread calculating the quota at very small regular intervals based on 
network bandwith information. This can enable us to micro-stun the vcpus 
(steal their runtime just the moment they were dirtying heavily).

We have implemented a "common quota" approach, i.e. transfering any 
unused quota to a common pool so that it can be consumed by any vcpu in 
the next interval on FCFS basis.

It seemed fit to implement all this logic on the userspace side and just 
keep the "dirty count" and the "logic to exit to userspace whenever the 
vcpu has consumed its quota" on the kernel side. The count is required 
on the userspace side because there are cases where a vcpu can actually 
dirty more than its quota (e.g. if PML is enabled). Hence, this 
information can be useful on the userspace side and can be used to 
re-adjust the next quotas.

Thank you for the question. Please let me know if you have further concerns.

+	return 0;
+}
The other high level question is whether you have considered using the ring
full event to achieve similar goal?

Right now KVM_EXIT_DIRTY_RING_FULL event is generated when per-vcpu ring
gets full.  I think there's a problem that the ring size can not be
randomly set but must be a power of 2.  Also, there is a maximum size of
ring allowed at least.

However since the ring size can be fairly small (e.g. 4096 entries) it can
still achieve some kind of accuracy.  For example, the userspace can
quickly kick the vcpu back to VM_RUN only until it sees that it reaches
some quota (and actually that's how dirty-limit is implemented on QEMU,
contributed by China Telecom):

https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_qemu-2Ddevel_cover.1646243252.git.huangy81-40chinatelecom.cn_&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=y6cIruIsp50rH6ImgUi28etki9RTCTHLhRic4IzAtLa62j9PqDMsKGmePy8wGIy8&s=tAZZzTjg74UGxGVzhlREaLYpxBpsDaNV4X_DNdOcUJ8&e=

Is there perhaps some explicit reason that dirty ring cannot be used?

Thanks!
When we started this series, AFAIK it was not possible to set the dirty 
ring size once the vcpus are created. So, we couldn't dynamically set 
dirty ring size. Also, since we are going for micro-stunning and the 
allowed dirties in such small intervals can be pretty low, it can cause 
issues if we can only use a dirty quota which is a power of 2. For 
instance, if the dirty quota is to be set to 9, we can only set it to 16 
(if we round up) and if dirty quota is to be set to 15 we can only set 
it to 8 (if we round down). I hope you'd agree that this can make a huge 
difference.

Also, this approach works for both dirty bitmap and dirty ring interface 
which can help in extending this solution to other architectures.

I'm very grateful for the questions. Looking forward to more feedback. 
Thanks.