Re: concurrent migration of several domains rarely fails

"Lentes, Bernd" <bernd.lentes@xxxxxxxxxxxxxxxxxxxxx> · Thu, 6 Dec 2018 18:12:59 +0100 (CET)

> Hi,
> 
> i have a two-node cluster with several domains as resources. During testing i
> tried several times to migrate some domains concurrently.
> Usually it suceeded, but rarely it failed. I found one clue in the log:
> 
> Dec 03 16:03:02 ha-idg-1 libvirtd[3252]: 2018-12-03 15:03:02.758+0000: 3252:
> error : virKeepAliveTimerInternal:143 : internal error: connection closed due
> to keepalive timeout
> 
> The domains are configured similar:
> primitive vm_geneious VirtualDomain \
>        params config="/mnt/san/share/config.xml" \
>        params hypervisor="qemu:///system" \
>        params migration_transport=ssh \
>        op start interval=0 timeout=120 trace_ra=1 \
>        op stop interval=0 timeout=130 trace_ra=1 \
>        op monitor interval=30 timeout=25 trace_ra=1 \
>        op migrate_from interval=0 timeout=300 trace_ra=1 \
>        op migrate_to interval=0 timeout=300 trace_ra=1 \
>        meta allow-migrate=true target-role=Started is-managed=true \
>        utilization cpu=2 hv_memory=8000
> 
> What is the algorithm to discover the port used for live migration ?
> I have the impression that "params migration_transport=ssh" is worthless, port
> 22 isn't involved for live migration.
> My experience is that for the migration tcp ports > 49151 are used. But the
> exact procedure isn't clear for me.
> Does live migration uses first tcp port 49152 and for each following domain one
> port higher ?
> E.g. for the concurrent live migration of three domains 49152, 49153 and 49154.
> 
> Why does live migration for three domains usually succeed, although on both
> hosts just 49152 and 49153 is open ?
> Is the migration not really concurrent, but sometimes sequential ?
> 
> Bernd
> 
Hi,

i tried to narrow down the problem.
My first assumption was that something with the network between the hosts is not ok.
I opened port 49152 - 49172 in the firewall - problem persisted.
So i deactivated the firewall on both nodes - problem persisted.

Then i wanted to exclude the HA-Cluster software (pacemaker).
I unmanaged the VirtualDomains in pacemaker and migrated them with virsh - problem persists.

I wrote a script to migrate three domains sequentially from host A to host B and vice versa via virsh.
I raised up the loglevel from libvirtd and found s.th. in the log which may be the culprit:

This is the output of my script:

Thu Dec  6 17:02:53 CET 2018
migrate sim
Migration: [100 %]
Thu Dec  6 17:03:07 CET 2018
migrate geneious
Migration: [100 %]
Thu Dec  6 17:03:16 CET 2018
migrate mausdb
Migration: [ 99 %]error: operation failed: migration job: unexpectedly failed    <===== error !

Thu Dec  6 17:05:32 CET 2018      <======== time of error
Guests on ha-idg-1: \n
 Id    Name                           State
----------------------------------------------------
 1     sim                            running
 2     geneious                       running
 -     mausdb                         shut off

migrate to ha-idg-2\n
Thu Dec  6 17:05:32 CET 2018

This is what journalctl told:

Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=0 idle=30
Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: error : virKeepAliveTimerInternal:143 : internal error: connection closed due to keepalive timeout
Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x55b2bb937740

Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=1 idle=25
Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1

Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=2 idle=20
Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1

Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=3 idle=15
Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1

Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=4 idle=10
Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1

Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=5 idle=5
Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1

There seems to be a kind of a countdown. From googleing i found that this may be related to libvirtd.conf:

# Keepalive settings for the admin interface
#admin_keepalive_interval = 5
#admin_keepalive_count = 5

What is meant by the "admin interface" ? virsh ?
What is meant by "client" in libvirtd.conf ? virsh ? Why do i have regular timeouts although my two hosts are very performant ? 128GB RAM, 16 cores, 2 1GBit/s network adapter on each host in bonding.
During migration i don't see much load, although nearly no waiting for IO.

Should i set admin_keepalive_interval to -1 ?

Bernd

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDirig.in Petra Steiner-Hoffmann
Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter
Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, Dr. rer. nat. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

_______________________________________________
libvirt-users mailing list
libvirt-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvirt-users