first of all, thank you very much for your time and input on this matter!
We truly strive to improve the Prometheus exporter to be a solid tool in the monitoring box.
Is there any way to not run into lock contention, like running a request
with some "nolock" indication?
You can use flag VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT which should
skip getting any unavailable stats if the domain has a job running and
libvirt can't grab a new job.
This flag is only available for "virConnectGetAllDomainStats",
but we also use e.g.
" virDomainMemoryStats", "virDomainInterfaceStats" or
"virDomainBlockStats".
Could we somehow switch to only "virDomainBlockStats" and by
enabling all
stats to be returned? It seems though, that more detailed memory
stats are only returned by
"virDomainMemoryStats".
Yes, the domain is being modified by the migration, so it is locked.While this is true, the "lock" - or job I should rather say is an async one, meaning a QUERY job can be acquired. It's only MODIFY job that should wait in the queue. What's rather weird is - the thread holding the job is 'MigratePrepare' which usually isn't that long.
Let me ask again if this could be related to the type of
migration
(Tunneled vs. native - https://libvirt.org/migration.html).
We also see error messages logged by libvirtd itself ....
--cut ---
Mar 13 13:09:21 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-00020100; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 39s)
Mar 13 13:09:21 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:09:21 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-00020100; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 39s)
Mar 13 13:09:21 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:09:31 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-00020100; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 49s)
Mar 13 13:09:31 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:09:31 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-00020100; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 49s)
Mar 13 13:09:31 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:21 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-0001f8f7; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 33s)
Mar 13 13:14:21 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:21 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-0001f8f7; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 33s)
Mar 13 13:14:21 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:31 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-0001f8f7; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 43s)
Mar 13 13:14:31 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:31 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-0001f8f7; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 44s)
Mar 13 13:14:31 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:41 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-0001f8f7; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 53s)
Mar 13 13:14:41 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:41 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-0001f8f7; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 54s)
Mar 13 13:14:41 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-0001f8f7; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 63s)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-0001f8f7; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 63s)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-0001f8f7; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 63s)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query,
none, none) for domain instance-0001f8f7; current job is (none,
none, migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s,
0s, 64s)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during
operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
--- cut ---
unfortunately there is no mention which client or call these
originate from.
@Christian, what is the libvirt version? Are you able to reproduce with either libvirt-10.1.0 or (even better) current master?
We are using 8.0.0-1ubuntu7.8 via Ubuntu 22.04 packages. Unfortunately we cannot simply upgrade to 10.x.
Do you expect any of the changes between 8 and 10 in particular to make a difference here?
With live migration making requests across multiple libvirt daemons, if the target host has filled its 5 requests queue with long running operations, and then a "prepare migrate' call comes in, that'll get stalled behind a possibly slow operation at the RPC dispatch level. I'd suggest bumping 'max_client_requests' to 100 and seeing if the problem goes away.
We currently run with the default value of "5" and shall try and
raise it some.
Please also see the error messages above. We unfortunately cannot
easily determine
which clients receive this error or which calls lead to them. But
we do know that the "migration in" seems to be holding these
locks.
Our clients should only be ...
* libvirt itself (coordinating migrations)
* OpenStack Nova "nova-compute"
* libvirt-exporter
Could it be that due to the communication happening via unix
socket that there is so little context here?
All those "none" and "null" values in the error message.
Regards
Christian
_______________________________________________ Users mailing list -- users@xxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxx