Sorry, this mail got buried a bit on my side. On Tue, Apr 02, 2024 at 01:23:13PM +0200, Christian Rohmann wrote:
Hello Daniel, Michael, Martin, all, first of all, thank you very much for your time and input on this matter! We truly strive to improve the Prometheus exporter to be a solid tool in the monitoring box. On 07.03.24 10:51 AM, Martin Kletzander wrote:Is there any way to not run into lock contention, like running a request with some "nolock" indication?You can use flag VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT which should skip getting any unavailable stats if the domain has a job running and libvirt can't grab a new job.This flag is only available for "virConnectGetAllDomainStats", but we also use e.g. " virDomainMemoryStats", "virDomainInterfaceStats" or "virDomainBlockStats". Could we somehow switch to only "virDomainBlockStats" and by enabling all stats to be returned? It seems though, that more detailed memory stats are only returned by "virDomainMemoryStats".
Do you know from the top of your head what stats are returned by virDomainMemoryStats while missing in AllDomainStats? Maybe consolidating the code paths could be one solution.
On 07.03.24 4:20 PM, Michal Prívozník wrote:Yes, the domain is being modified by the migration, so it is locked.While this is true, the "lock" - or job I should rather say is an async one, meaning a QUERY job can be acquired. It's only MODIFY job that should wait in the queue. What's rather weird is - the thread holding the job is 'MigratePrepare' which usually isn't that long.Let me ask again if this could be related to the type of migration (Tunneled vs. native - https://libvirt.org/migration.html).
This is my bad, it does not matter, but the job was _created_ by MigratePrepare, however it is probably in a Perform phase during almost all of the time. And during that phase it is not only not possible to gather lot of data, it also does not make sense to fetch them.
We also see error messages logged by libvirtd itself .... --cut --- Mar 13 13:09:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 39s) Mar 13 13:09:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:09:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 39s) Mar 13 13:09:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:09:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 49s) Mar 13 13:09:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:09:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 49s) Mar 13 13:09:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 33s) Mar 13 13:14:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 33s) Mar 13 13:14:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 43s) Mar 13 13:14:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 44s) Mar 13 13:14:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:41 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 53s) Mar 13 13:14:41 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:41 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 54s) Mar 13 13:14:41 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s) Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s) Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s) Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 64s) Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) --- cut --- unfortunately there is no mention which client or call these originate from.
Well, you could check the PID and with more debug logs figure out who is calling the API that fails.
@Christian, what is the libvirt version? Are you able to reproduce with either libvirt-10.1.0 or (even better) current master?We are using 8.0.0-1ubuntu7.8 via Ubuntu 22.04 packages. Unfortunately we cannot simply upgrade to 10.x. Do you expect any of the changes between 8 and 10 in particular to make a difference here? On 07.03.24 4:30 PM, Daniel P. Berrangé wrote:With live migration making requests across multiple libvirt daemons, if the target host has filled its 5 requests queue with long running operations, and then a "prepare migrate' call comes in, that'll get stalled behind a possibly slow operation at the RPC dispatch level. I'd suggest bumping 'max_client_requests' to 100 and seeing if the problem goes away.We currently run with the default value of "5" and shall try and raise it some.
Have you tried that? Did it make a difference?
Please also see the error messages above. We unfortunately cannot easily determine which clients receive this error or which calls lead to them. But we do know that the "migration in" seems to be holding these locks. Our clients should only be ... * libvirt itself (coordinating migrations) * OpenStack Nova "nova-compute" * libvirt-exporter Could it be that due to the communication happening via unix socket that there is so little context here?
Most probably not.
All those "none" and "null" values in the error message.
Those are for various fields of the job which cannot be all set, it's just an internal representation of the jobs. That should be fine the way it is.
Regards Christian
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ Users mailing list -- users@xxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxx