On 04/05/22 10:56 pm, Peter Xu wrote:
On Wed, May 04, 2022 at 12:03:57PM +0530, Shivam Kumar wrote:
On 03/05/22 7:13 pm, Peter Xu wrote:
On Tue, May 03, 2022 at 12:52:26PM +0530, Shivam Kumar wrote:
On 03/05/22 3:44 am, Peter Xu wrote:
Hi, Shivam,
On Sun, Mar 06, 2022 at 10:08:48PM +0000, Shivam Kumar wrote:
+static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
+{
+ u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
+ u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
+ struct kvm_run *run = vcpu->run;
+
+ if (!dirty_quota || (pages_dirtied < dirty_quota))
+ return 1;
+
+ run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
+ run->dirty_quota_exit.count = pages_dirtied;
+ run->dirty_quota_exit.quota = dirty_quota;
Pure question: why this needs to be returned to userspace? Is this value
set from userspace?
1) The quota needs to be replenished once exhasuted.
2) The vcpu should be made to sleep if it has consumed its quota pretty
quick.
Both these actions are performed on the userspace side, where we expect a
thread calculating the quota at very small regular intervals based on
network bandwith information. This can enable us to micro-stun the vcpus
(steal their runtime just the moment they were dirtying heavily).
We have implemented a "common quota" approach, i.e. transfering any unused
quota to a common pool so that it can be consumed by any vcpu in the next
interval on FCFS basis.
It seemed fit to implement all this logic on the userspace side and just
keep the "dirty count" and the "logic to exit to userspace whenever the vcpu
has consumed its quota" on the kernel side. The count is required on the
userspace side because there are cases where a vcpu can actually dirty more
than its quota (e.g. if PML is enabled). Hence, this information can be
useful on the userspace side and can be used to re-adjust the next quotas.
I agree this information is useful. Though my question was that if the
userspace should have a copy (per-vcpu) of that already then it's not
needed to pass it over to it anymore?
This is how we started but then based on the feedback from Sean, we moved
'pages_dirtied' to vcpu stats as it can be a useful stat. The 'dirty_quota'
variable is already shared with userspace as it is in the vcpu run struct
and hence the quota can be modified by userspace on the go. So, it made
sense to pass both the variables at the time of exit (the vcpu might be
exiting with an old copy of dirty quota, which the userspace needs to know).
Correct.
My point was the userspace could simply cache the old quota too in the
userspace vcpu struct even if there's the real quota value in vcpu run.
No strong opinion, but normally if this info is optional to userspace I
think it's cleaner to not pass it over again to keep the kernel ABI simple.
Ack. Though this implementation path aimed at not reserving extra memory
for old values of dirty quota and making the solution robust to multiple
changes (multiple old versions) in dirty quota during dirty quota exit,
which is rare. There was no such strong reason.
Thanks.
Thank you for the question. Please let me know if you have further concerns.
+ return 0;
+}
The other high level question is whether you have considered using the ring
full event to achieve similar goal?
Right now KVM_EXIT_DIRTY_RING_FULL event is generated when per-vcpu ring
gets full. I think there's a problem that the ring size can not be
randomly set but must be a power of 2. Also, there is a maximum size of
ring allowed at least.
However since the ring size can be fairly small (e.g. 4096 entries) it can
still achieve some kind of accuracy. For example, the userspace can
quickly kick the vcpu back to VM_RUN only until it sees that it reaches
some quota (and actually that's how dirty-limit is implemented on QEMU,
contributed by China Telecom):
https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_qemu-2Ddevel_cover.1646243252.git.huangy81-40chinatelecom.cn_&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=y6cIruIsp50rH6ImgUi28etki9RTCTHLhRic4IzAtLa62j9PqDMsKGmePy8wGIy8&s=tAZZzTjg74UGxGVzhlREaLYpxBpsDaNV4X_DNdOcUJ8&e=
Is there perhaps some explicit reason that dirty ring cannot be used?
Thanks!
When we started this series, AFAIK it was not possible to set the dirty ring
size once the vcpus are created. So, we couldn't dynamically set dirty ring
size.
Agreed. The ring size can only be set when startup and can't be changed.
Also, since we are going for micro-stunning and the allowed dirties in
such small intervals can be pretty low, it can cause issues if we can
only use a dirty quota which is a power of 2. For instance, if the dirty
quota is to be set to 9, we can only set it to 16 (if we round up) and if
dirty quota is to be set to 15 we can only set it to 8 (if we round
down). I hope you'd agree that this can make a huge difference.
Yes. As discussed above, I didn't expect the ring size to be the quota
per-se, so what I'm wondering is whether we can leverage a small and
constant sized ring to emulate the behavior of a quota with any size, but
with a minimum granule of the dirty ring size.
This would be an interesting thing to try. I've already planned efforts to
optimise it for dirty ring interface. Thank you for this suggestion.
Side question: Is there any plan to make it possible to dynamically update
the dirty ring size?
No plan that I'm aware of..
After I checked: kvm_dirty_ring_get_rsvd_entries() limits our current ring
size, which is hardware dependent on PML. It seems at least 1024 will
still be a work-for-all case, but not sure how it'll work in reality since
then soft_limit of the dirty ring will be relatively small so it'll kick to
userspace more often. Maybe that's not a huge problem for a throttle
scenario. In that case the granule will be <=4MB if it'll work.
Ack. Thanks.
Also, this approach works for both dirty bitmap and dirty ring interface
which can help in extending this solution to other architectures.
Is there any specific arch that you're interested outside x86?
x86 is the first priority but this patchset targets s390 and arm as well.
I see.
Logically we can also think about extending dirty ring to other archs, but
there were indeed challenges where some pages can be dirtied without a vcpu
context, and that's why it was only supported initially on x86.
This is an interesting problem and we are aware of it. We have a couple of
ideas but they are very raw as of now.
I think a default (no-vcpu) ring will solve most of the issues, but that
requires some thoughts, e.g. the major difference between ring and bitmap
is that ring can be full while bitmap cannot.. We need to think careful on
that when it comes.
The other thing is IIRC s390 has been using dirty bits on the pgtables
(either radix or hash based) to trap dirty, so that'll be challenging too
when connected with a ring interface because it could make the whole thing
much less helpful..
So from that pov I think your solution sounds reasonable on that it
decouples the feature with the interface, and it also looks simple.
Ack. Thanks.
I think it should not be a problem for the quota solution, because it's
backed up by the dirty bitmap so no dirty page will be overlooked for
migration purpose, which is definitely a benefit. But I'm still curious
whether you looked into any specific archs already (x86 doesn't have such
problem) so that maybe there's some quota you still want to apply elsewhere
where there's no vcpu context.
Yes, this is kind of similar to one of the ideas we have thought. Though,
there are many things which need a lot of brainstorming, e.g. the ratio in
which we can split the overall quota to accomodate for dirties with no vcpu
context.
I'm slightly worried it'll make things even more complicated.
Only until we're thinking seriously on non-x86 platforms (since again x86
doesn't have this issue afaict..): I think one thing we could do is to dig
out all these cases and think about whether they do need any quota at all.
IOW, whether the no-vcpu dirty context are prone to have VM live migration
converge issue. If the answer is no, IMHO a simpler approach is we can
ignore those dirty pages for quota purpose.
Yes, we are running some experiments to identify such cases where enough
dirty can happen without vcpu context to make migration not converge.
I think that's a major benefit of your approach comparing to the full dirty
ring approach, because you're always backed by the dirty bitmap so there's
no way of data loss. Dirty ring's one major challenge is how to make sure
all dirty PFNs get recorded and meanwhile we don't interrupt kvm workflow
in general.
One thing I'd really appreciate is if this solution would be accepted from
the kernel POV and if you plan to work on a qemu version of it, please
consider reading the work from Hyman Huang from China Telecom on the dirty
limit solution (which is currently based on dirty ring):
https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_qemu-2Ddevel_cover.1648748793.git.huangy81-40chinatelecom.cn_&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=WHTbPjZer3ai__KbADSbXu_06rsu-MDRK4LCpRgwdXVXtMPlxN2MVMjGzsvBlOqz&s=sMVOOszKIvQ2vM03bdMEhVOAkeN55QgFUk_XbUm2JRI&e=
Since they'll be very similar approaches to solving the same problem.
Hopefully we could unify the work and not have two fully separate
approaches even if they should really share something in common.
Definitely, this is already on my reading list. Thanks.
Thanks,
Thank you for the comments.