Re: How to monitor domains in regards steal time and other important metrics (VIR_DOMAIN_STATS_VCPU) ?

Guy Godfroy <guy.godfroy@xxxxxxxx> · Fri, 19 Jan 2024 13:05:30 +0100

Hi Christian,

I can't answer to your question which is too technical for my humble 
knowledge but I wanted to seize the opportunity to thank you for your 
effort into maintaining a prometheus exporter for libvirt.

Also I wanted to talk a bit about the features of your exporter, maybe 
this discussion should be held elsewhere, let me know.

You proposed the prometheus-community to adopt you exporter, which is a 
super cool idea. But IMHO before that you should have or plan to expose 
a bit more metrics.

The metric list in the README contains only domain related metrics. 
Other exporters like tinkoff (the one I'm using) expose a bit more, I 
know there are metrics about pools at least. Do you plan to include more 
metrics in the future (volumes, volume pools, networks...)? I can 
understand if you need only domain related metrics but I think the other 
metrics should be there if this become a kinda official exporter for 
libvirt.

Thanks again.

Guy Godfroy

Le 19/01/2024 à 12:35, Christian Rohmann via Users a écrit :
With the holidays and all I take the liberty to bump this post.
Anybody got any idea on how to monitor steal time then?

On 21.12.23 17:36, Christian Rohmann wrote:
Hey libvirt-users,

first allow me to give a little background.

We monitor performance metrics of OpenStack Nova VMs using libvirt as 
hypervisor. We used to run the libvirt prometheus exporter written by 
zhangjianweibj [1].
This exporter, compared to the one from kumina / tinkoff ([2]) makes 
use of the DigitalOcean go-libvirt [3], but that should not make much 
of a difference for my questions.
Since the development of that exporter seems to have stalled and we 
wanted to rework and contribute new features to it, we created a fork 
[4].
After working trough the various ideas we had and applying them to 
the code, we proposed the prometheus-community to adopt the exporter 
[5] to ensure it is maintained
and to serve as a reference exporter even.

Now to my actual question ...

Libvirt exposes per VCPU stats for domains via [6]. I'd like to be 
able to export those via the exporter.
One important metric to me would be things like the steal time 
(vcpu.<num>.delay), to determine is domains are starting to get cut 
short or even starve
on cpu time. Apparently those metrics are / cannot be expose anymore 
since the switch to CGroupsV2? Reading [7] or [8] others seem to have 
run into this.

Is this actually still the case, even for more recent kernels? If so, 
I am wondering if there is an issue being tracked to implement this 
functionality?
How is the steal time reported to the guest if the hypervisor is 
unable to export this info?

Then there are other approaches like vmtop by Digital Ocean [9], 
which does use info and metrics available via /proc to determine 
steal time and other vcpu based metrics.
So it seems the required data is somewhat available from the kernel?

Last but not least I'd like your opinion on what other key metrics 
are important to monitoring on hypervisors and their guests?

Regards

Christian

[1] https://github.com/zhangjianweibj/prometheus-libvirt-exporter
[2] https://github.com/Tinkoff/libvirt-exporter
[3] https://github.com/digitalocean/go-libvirt
[4] https://github.com/inovex/prometheus-libvirt-exporter
[5] https://github.com/prometheus-community/community/issues/50
[6] 
https://libvirt.org/html/libvirt-libvirt-domain.html#VIR_DOMAIN_STATS_VCPU
[7] https://bugzilla.redhat.com/show_bug.cgi?id=2015763
[8] https://bugzilla.redhat.com/show_bug.cgi?id=1796543
[9] https://github.com/digitalocean/vmtop/

_______________________________________________
Users mailing list -- users@xxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxx
_______________________________________________
Users mailing list -- users@xxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxx