Re: live migration fails with image on ceph

"Yuming Ma (yumima)" <yumima@xxxxxxxxx> · Thu, 16 Apr 2015 03:45:59 +0000

The issue is reproducible in svl-3 with rbd cache set to false.

On the 5th ping-pong, the instance experienced ping drops and did not
recover for 20+ minutes:

(os-clients)[root@fedora21 nimbus-env]# nova live-migration lmtest1
(os-clients)[root@fedora21 nimbus-env]# nova show lmtest1 |grep -E
'hypervisor_hostname|task_state|vm_state'
| OS-EXT-SRV-ATTR:hypervisor_hostname  | svl-3-cc-nova1-002.cisco.com
                                             |
| OS-EXT-STS:task_state                | migrating
                                             |
| OS-EXT-STS:vm_state                  | active
                                             |
(os-clients)[root@fedora21 nimbus-env]# nova show lmtest1 |grep -E
'hypervisor_hostname|task_state|vm_state'
| OS-EXT-SRV-ATTR:hypervisor_hostname  | svl-3-cc-nova1-001.cisco.com
                                             |
| OS-EXT-STS:task_state                | -
                                             |
| OS-EXT-STS:vm_state                  | active
                                             |
(os-clients)[root@fedora21 nimbus-env]# ping -c3 -S60 10.33.143.215
PING 10.33.143.215 (10.33.143.215) 56(84) bytes of data.

--- 10.33.143.215 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2001ms

(os-clients)[root@fedora21 nimbus-env]# ping -c3 -S60 10.33.143.215
PING 10.33.143.215 (10.33.143.215) 56(84) bytes of data.

--- 10.33.143.215 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms

(os-clients)[root@fedora21 nimbus-env]# ping -c3 -S60 10.33.143.215
PING 10.33.143.215 (10.33.143.215) 56(84) bytes of data.

--- 10.33.143.215 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms

‹ Yuming

On 4/10/15, 4:51 PM, "Josh Durgin" <jdurgin@xxxxxxxxxx> wrote:

>On 04/08/2015 09:37 PM, Yuming Ma (yumima) wrote:
>> Josh,
>>
>> I think we are using plain live migration and not mirroring block drives
>> as the other test did.
>
>Do you have the migration flags or more from the libvirt log? Also
>which versions of qemu is this?
>
>The libvirt log message about qemuMigrationCancelDriveMirror from your
>first email is suspicious. Being unable to stop it may mean it was not
>running (fine, but libvirt shouldn't have tried to stop it), or it kept
>running (bad esp. if it's trying to copy to the same rbd).
>
>> What are the chances or scenario that disk image
>> can be corrupted during the live migration for both source and target
>> are connected to the same volume and RBD caches is turned on:
>
>Generally rbd caching with live migration is safe. The way to get
>corruption is to have drive-mirror try to copy over the rbd on the
>destination while the source is still using the disk...
>
>Did you observe fs corruption after a live migration, or just other odd
>symptoms? Since a reboot fixed it, it sounds more like memory corruption
>to me, unless it was fsck'd during reboot.
>
>Josh

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com