Re: Error Cannot acquire state change lock from remoteDispatchDomainMigratePrepare3Params during live migration of domains

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry, this mail got buried a bit on my side.

On Tue, Apr 02, 2024 at 01:23:13PM +0200, Christian Rohmann wrote:
Hello Daniel, Michael, Martin, all,


first of all, thank you very much for your time and input on this matter!
We truly strive to improve the Prometheus exporter to be a solid tool in
the monitoring box.



On 07.03.24 10:51 AM, Martin Kletzander wrote:

Is there any way to not run into lock contention, like running a request
with some "nolock" indication?


You can use flag VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT which should
skip getting any unavailable stats if the domain has a job running and
libvirt can't grab a new job.

This flag is only available for "virConnectGetAllDomainStats", but we
also use  e.g.
" virDomainMemoryStats", "virDomainInterfaceStats" or "virDomainBlockStats".

Could we somehow switch to only "virDomainBlockStats" and by enabling all
stats to be returned? It seems though, that more detailed memory stats
are only returned by
"virDomainMemoryStats".


Do you know from the top of your head what stats are returned by
virDomainMemoryStats while missing in AllDomainStats?

Maybe consolidating the code paths could be one solution.


On 07.03.24 4:20 PM, Michal Prívozník wrote:
Yes, the domain is being modified by the migration, so it is locked.
While this is true, the "lock" - or job I should rather say is an async
one, meaning a QUERY job can be acquired. It's only MODIFY job that
should wait in the queue.

What's rather weird is - the thread holding the job is 'MigratePrepare'
which usually isn't that long.

Let me ask again if this could be related to the type of migration
(Tunneled vs.  native - https://libvirt.org/migration.html).


This is my bad, it does not matter, but the job was _created_ by
MigratePrepare, however it is probably in a Perform phase during almost
all of the time.

And during that phase it is not only not possible to gather lot of data,
it also does not make sense to fetch them.

We also see error messages logged by libvirtd itself ....

--cut ---
Mar 13 13:09:21 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-00020100; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 39s)
Mar 13 13:09:21 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:09:21 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-00020100; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 39s)
Mar 13 13:09:21 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:09:31 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-00020100; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 49s)
Mar 13 13:09:31 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:09:31 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-00020100; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 49s)
Mar 13 13:09:31 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:21 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-0001f8f7; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 33s)
Mar 13 13:14:21 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:21 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-0001f8f7; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 33s)
Mar 13 13:14:21 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:31 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-0001f8f7; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 43s)
Mar 13 13:14:31 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:31 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-0001f8f7; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 44s)
Mar 13 13:14:31 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:41 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-0001f8f7; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 53s)
Mar 13 13:14:41 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:41 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-0001f8f7; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 54s)
Mar 13 13:14:41 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-0001f8f7; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-0001f8f7; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-0001f8f7; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none,
none) for domain instance-0001f8f7; current job is (none, none,
migration in) owned by (0 <null>, 0 <null>, 0
remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 64s)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation:
cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
--- cut ---

unfortunately there is no mention which client or call these originate from.


Well, you could check the PID and with more debug logs figure out who is
calling the API that fails.

@Christian, what is the libvirt version? Are you able to reproduce with
either libvirt-10.1.0 or (even better) current master?

We are using 8.0.0-1ubuntu7.8 via Ubuntu 22.04 packages. Unfortunately
we cannot simply upgrade to 10.x.
Do you expect any of the changes between 8 and 10 in particular to make
a difference here?



On 07.03.24 4:30 PM, Daniel P. Berrangé wrote:
With live migration making requests across multiple libvirt daemons,
if the target host has filled its 5 requests queue with long running
operations, and then a "prepare migrate' call comes in, that'll get
stalled behind a possibly slow operation at the RPC dispatch level.

I'd suggest bumping 'max_client_requests' to 100 and seeing if the
problem goes away.

We currently run with the default value of "5" and shall try and raise
it some.


Have you tried that?  Did it make a difference?

Please also see the error messages above. We unfortunately cannot easily
determine
which clients receive this error or which calls lead to them. But we do
know that the "migration in" seems to be holding these locks.

Our clients should only be ...

* libvirt itself (coordinating migrations)
* OpenStack Nova "nova-compute"
* libvirt-exporter

Could it be that due to the communication happening via unix socket that
there is so little context here?

Most probably not.

All those "none" and "null" values in the error message.


Those are for various fields of the job which cannot be all set, it's
just an internal representation of the jobs.  That should be fine the
way it is.



Regards


Christian

Attachment: signature.asc
Description: PGP signature

_______________________________________________
Users mailing list -- users@xxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxx

[Index of Archives]     [Virt Tools]     [Lib OS Info]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]

  Powered by Linux