On Wed, Mar 06, 2024 at 05:17:36PM +0100, Christian Rohmann via Users wrote:
Hallo libvirt-users!
Hi, I'll try to reply in the simplest possible way.
we observe lock-ups / timeouts with in prometheus-libvirt-exporter (https://github.com/inovex/prometheus-libvirt-exporter) when libvirt is live-migrating domains:Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)All of the source code can be found at: https://github.com/inovex/prometheus-libvirt-exporter/blob/master/pkg/exporter/prometheus-libvirt-exporter.go. Basically the error happens when DomainMemoryStats or other operational domain info is queried via the libvirt socket.
Yes, the domain is being modified by the migration, so it is locked.
1) We are actually using the read-only socket at '/var/run/libvirt/libvirt-sock-ro', so there should not be any locking required.
On the contrary, even for reading you need a read lock if someone is writing.
Is there any way to not run into lock contention, like running a request with some "nolock" indication?
You can use flag VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT which should skip getting any unavailable stats if the domain has a job running and libvirt can't grab a new job.
2) This being reported as timeout waiting for the lock, what is the timeout and would waiting a bit longer help? Or is the lock active during the whole time a domain live migration is running?
Basically, mostly, yes.
3) Is this in any way related to the type of migration? Tunneled vs. native (https://libvirt.org/migration.html)?
Not really.
4) Is there any indication that we could use to skip those domains (or certain queries)?
Well, you could decide that based on the error returned, but it's better not to wait for the error and skip the unavailable stats as written above. Some might think of an idea of checking whether there is a job running on the domain and skip such domains, but that's an obvious race condition and you'd not have any stats during other jobs running.
The same issue was actually previously reported for another implementation of a Prometheus exporter (https://github.com/kumina/libvirt_exporter/issues/33). Currently the exporter locks up or throws the mentioned timeout errors during the the migration of 200 domains, 5 at a time. It would be awesome to find a way to make this work as smooth as possible, even during live migrations! I am thankful for any insights into how the libvirt socket, the various calls, the locking mechanisms or live migration modes work! Regards Christian _______________________________________________ Users mailing list -- users@xxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxx
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ Users mailing list -- users@xxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxx