I'm certainly totally opposed to stats that don't have a clear use
case. People keep piling random stats that satisfy their pet usage,
and this only bloats the various structures for no overall benefit
other than "hey, it might be useful". This is death by a thousand cut.
And while we're talking about pages_dirtied, I really dislike the
WARN_ON in mark_page_dirty_in_slot(). A counter has rolled over?
Shock, horror...
Ack. I'll give it a thought but if you have any specific suggestion on
how I can make it better, kindly let me know. Thanks.
What is the effect of counter overflowing? Why is it important to
warn? What goes wrong? What could be changed to *avoid* this being an
issue?
M.
When dirty quota is not enabled, counter overflow has no harm as such.
If dirty logging is enabled with dirty quota, two cases may arise:
i) While setting the dirty quota to count + new quota, the dirty quota
itself overflows. Now, if the userspace doesn’t manipulate the count
accordingly: the count will still be a large value and so the vcpu will
exit to userspace again and again until the count also overflows at some
point (this is inevitable because the count will continue getting
incremented after each write).
ii) Dirty quota is very close to the max value of a 64 bit unsigned
int. Dirty count can overflow in this case and the vcpu might never exit
to userspace (with exit reason - dirty quota exhausted) which means no
throttling happens. One possible way to resolve this is by exiting to
userspace as soon as the count equals the dirty quota. By not waiting
for the dirty count to exceed the dirty quota, we can avoid this.
Though, this is difficult to achieve due to Intel’s PML.
In both these cases, nothing catastrophic happens; it’s just that the
userspace’s expectations are not met. However, we can have clear
instructions in the documentation on how the userspace can avoid these
issues altogether by resetting the count and quota values whenever they
exceed a safe level (for now I am able to think of 512 less than the
maximum value of an unsigned int as a safe value for dirty quota). To
allow the userspace to do it, we need to provide the userspace some way
to reset the count. I am not sure how we can achieve this if the dirty
count (i.e. pages dirtied) is a KVM stat. But, if we can make it a
member of kvm_run, it is fairly simple to do. So IMO, yes, warn is
useless here.
Happy to know your thoughts on this. Really grateful for the help so far.
Thanks,
Shivam