VM locking in EL6 clusters

"Devin A. Bougie" <devin.bougie@xxxxxxxxxxx> · Thu, 15 May 2014 15:16:52 +0000

Hello,

We are running three EL6 clusters using the Red Hat High Availability Add-On, and are encountering issues with locking our clustered VM’s.

We followed the virtual machine disk locking documentation (http://libvirt.org/locking.html#sanlockstorage), specifically its recommendation to have sanlock's disk_lease_dir set to a shared GFS2 file system.  Ideally we would store the leases on a clustered logical volume block device, but haven’t yet figured out to configure this.  Our VM XML definitions are on the same GFS2 file system, and our KVM virtual machines are using clustered logical volumes for their block devices.

For example, here are a few configuration excerpts.
———
- From /etc/cluster/cluster.conf:
<clusterfs device="/dev/vgift1/cluster" force_unmount="0" fstype="gfs2" mountpoint="/gfs/cluster" name="cluster" options="noatime"/>
...
<vm autostart="0" domain="fd14" migrate="live" name="lnx91" path="/gfs/cluster/libvirt/qemu" recovery="relocate”/>

- From /etc/fstab:
/dev/vgift1/cluster	/gfs/cluster	gfs2	noatime	0	0

- From /gfs/cluster/libvirt/qemu/lnx91.xml:
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/vgift1/lnx91'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>

- From /etc/libvirt/quemu-sanlock.conf:
host_id = 3
auto_disk_leases = 1
disk_lease_dir = "/gfs/cluster/libvirt/sanlock"
———

In general this works well, but periodically we see warnings and eventual errors that break sanlock and subsequently each clustered VM.  After the problem occurs, recovery involves `service libvirtd restart ; clusvcadm -d vm:lnx91 ; clusvcadm -e vm:lnx91`.  This is consistent across all three of our clusters (two at 6.4, one at fully updated 6.5).  We see no other indications of problems with our iSCSI setup or GFS2 file systems.  Please see below for an example of what we see in /var/log/messages.

I would greatly appreciate any suggestions for resolving or debugging this problem on our EL6 clusters.  I would be happy to provide entire configuration or log files, or open a bugzilla report if necessary.

Many thanks,
Devin

———
…
May 14 23:32:43 lnx903 sanlock[8476]: 2014-05-14 23:32:43-0400 5909646 [8476]: s2 check_our_lease warning 60 last_success 5909586
May 14 23:32:44 lnx903 sanlock[8476]: 2014-05-14 23:32:44-0400 5909647 [8476]: s2 check_our_lease warning 61 last_success 5909586
May 14 23:32:45 lnx903 sanlock[8476]: 2014-05-14 23:32:45-0400 5909648 [8476]: s2 check_our_lease warning 62 last_success 5909586
May 14 23:32:46 lnx903 sanlock[8476]: 2014-05-14 23:32:46-0400 5909649 [8476]: s2 check_our_lease warning 63 last_success 5909586
May 14 23:32:47 lnx903 sanlock[8476]: 2014-05-14 23:32:47-0400 5909650 [8476]: s2 check_our_lease warning 64 last_success 5909586
May 14 23:32:48 lnx903 sanlock[8476]: 2014-05-14 23:32:48-0400 5909651 [8476]: s2 check_our_lease warning 65 last_success 5909586
May 14 23:32:49 lnx903 sanlock[8476]: 2014-05-14 23:32:49-0400 5909652 [8476]: s2 check_our_lease warning 66 last_success 5909586
May 14 23:32:50 lnx903 sanlock[8476]: 2014-05-14 23:32:50-0400 5909653 [8476]: s2 check_our_lease warning 67 last_success 5909586
May 14 23:32:51 lnx903 sanlock[8476]: 2014-05-14 23:32:51-0400 5909654 [8476]: s2 check_our_lease warning 68 last_success 5909586
May 14 23:32:52 lnx903 sanlock[8476]: 2014-05-14 23:32:52-0400 5909655 [8476]: s2 check_our_lease warning 69 last_success 5909586
May 14 23:32:53 lnx903 sanlock[8476]: 2014-05-14 23:32:53-0400 5909656 [8476]: s2 check_our_lease warning 70 last_success 5909586
May 14 23:32:54 lnx903 sanlock[8476]: 2014-05-14 23:32:54-0400 5909657 [8476]: s2 check_our_lease warning 71 last_success 5909586
May 14 23:32:55 lnx903 sanlock[8476]: 2014-05-14 23:32:55-0400 5909658 [8476]: s2 check_our_lease warning 72 last_success 5909586
May 14 23:32:56 lnx903 sanlock[8476]: 2014-05-14 23:32:56-0400 5909659 [8476]: s2 check_our_lease warning 73 last_success 5909586
May 14 23:32:57 lnx903 wdmd[3627]: test warning now 5909660 ping 5909650 close 5210877 renewal 5909586 expire 5909666 client 8476 sanlock___LIBVIRT__DISKS__:3
May 14 23:32:57 lnx903 wdmd[3627]: /dev/watchdog closed unclean
May 14 23:32:57 lnx903 kernel: iTCO_wdt: Unexpected close, not stopping watchdog!
May 14 23:32:57 lnx903 sanlock[8476]: 2014-05-14 23:32:57-0400 5909660 [8476]: s2 check_our_lease warning 74 last_success 5909586
May 14 23:32:58 lnx903 wdmd[3627]: test warning now 5909661 ping 5909650 close 5909660 renewal 5909586 expire 5909666 client 8476 sanlock___LIBVIRT__DISKS__:3
May 14 23:32:58 lnx903 sanlock[8476]: 2014-05-14 23:32:58-0400 5909661 [8476]: s2 check_our_lease warning 75 last_success 5909586
May 14 23:32:59 lnx903 wdmd[3627]: test warning now 5909662 ping 5909650 close 5909660 renewal 5909586 expire 5909666 client 8476 sanlock___LIBVIRT__DISKS__:3
May 14 23:32:59 lnx903 sanlock[8476]: 2014-05-14 23:32:59-0400 5909662 [8476]: s2 check_our_lease warning 76 last_success 5909586
May 14 23:33:00 lnx903 wdmd[3627]: test warning now 5909663 ping 5909650 close 5909660 renewal 5909586 expire 5909666 client 8476 sanlock___LIBVIRT__DISKS__:3
May 14 23:33:00 lnx903 sanlock[8476]: 2014-05-14 23:33:00-0400 5909663 [8476]: s2 check_our_lease warning 77 last_success 5909586
May 14 23:33:01 lnx903 wdmd[3627]: test warning now 5909664 ping 5909650 close 5909660 renewal 5909586 expire 5909666 client 8476 sanlock___LIBVIRT__DISKS__:3
May 14 23:33:01 lnx903 sanlock[8476]: 2014-05-14 23:33:01-0400 5909664 [8476]: s2 check_our_lease warning 78 last_success 5909586
May 14 23:33:02 lnx903 wdmd[3627]: test warning now 5909665 ping 5909650 close 5909660 renewal 5909586 expire 5909666 client 8476 sanlock___LIBVIRT__DISKS__:3
May 14 23:33:02 lnx903 sanlock[8476]: 2014-05-14 23:33:02-0400 5909665 [8476]: s2 check_our_lease warning 79 last_success 5909586
May 14 23:33:03 lnx903 wdmd[3627]: test failed rem 54 now 5909666 ping 5909650 close 5909660 renewal 5909586 expire 5909666 client 8476 sanlock___LIBVIRT__DISKS__:3
May 14 23:33:03 lnx903 sanlock[8476]: 2014-05-14 23:33:03-0400 5909666 [8476]: s2 check_our_lease failed 80
May 14 23:33:03 lnx903 sanlock[8476]: 2014-05-14 23:33:03-0400 5909666 [8476]: s2 kill 19407 sig 15 count 1
May 14 23:33:03 lnx903 sanlock[8476]: 2014-05-14 23:33:03-0400 5909666 [8476]: s2 kill 24035 sig 15 count 1
May 14 23:33:03 lnx903 sanlock[8476]: 2014-05-14 23:33:03-0400 5909667 [8476]: s2 kill 19407 sig 15 count 2
May 14 23:33:03 lnx903 sanlock[8476]: 2014-05-14 23:33:03-0400 5909667 [8476]: s2 kill 24035 sig 15 count 2
May 14 23:33:04 lnx903 kernel: br0: port 3(vnet1) entering disabled state
May 14 23:33:04 lnx903 kernel: device vnet1 left promiscuous mode
May 14 23:33:04 lnx903 kernel: br0: port 3(vnet1) entering disabled state
May 14 23:33:04 lnx903 wdmd[3627]: test failed rem 53 now 5909667 ping 5909650 close 5909660 renewal 5909586 expire 5909666 client 8476 sanlock___LIBVIRT__DISKS__:3
May 14 23:33:04 lnx903 kernel: br0: port 2(vnet0) entering disabled state
May 14 23:33:04 lnx903 kernel: device vnet0 left promiscuous mode
May 14 23:33:04 lnx903 kernel: br0: port 2(vnet0) entering disabled state
May 14 23:33:04 lnx903 sanlock[8476]: 2014-05-14 23:33:04-0400 5909668 [8476]: s2 kill 19407 sig 15 count 3
May 14 23:33:04 lnx903 sanlock[8476]: 2014-05-14 23:33:04-0400 5909668 [8476]: s2 kill 24035 sig 15 count 3
May 14 23:33:05 lnx903 wdmd[3627]: test failed rem 52 now 5909668 ping 5909650 close 5909660 renewal 5909586 expire 5909666 client 8476 sanlock___LIBVIRT__DISKS__:3
May 14 23:33:05 lnx903 ntpd[3431]: Deleting interface #47 vnet1, fe80::fc54:ff:febd:58b2#123, interface stats: received=0, sent=0, dropped=0, active_time=30984 secs
May 14 23:33:05 lnx903 ntpd[3431]: Deleting interface #46 vnet0, fe80::fc54:ff:fe14:d476#123, interface stats: received=0, sent=0, dropped=0, active_time=31076 secs
May 14 23:33:05 lnx903 ntpd[3431]: peers refreshed
May 14 23:33:05 lnx903 sanlock[8476]: 2014-05-14 23:33:05-0400 5909669 [8476]: s2 kill 19407 sig 15 count 4
May 14 23:33:05 lnx903 sanlock[8476]: 2014-05-14 23:33:05-0400 5909669 [8476]: s2 kill 24035 sig 15 count 4
May 14 23:33:06 lnx903 sanlock[8476]: 2014-05-14 23:33:06-0400 5909669 [8476]: dead 19407 ci 2 count 4
May 14 23:33:06 lnx903 wdmd[3627]: test failed rem 51 now 5909669 ping 5909650 close 5909660 renewal 5909586 expire 5909666 client 8476 sanlock___LIBVIRT__DISKS__:3
May 14 23:33:06 lnx903 sanlock[8476]: 2014-05-14 23:33:06-0400 5909669 [8476]: dead 24035 ci 3 count 4
May 14 23:33:06 lnx903 sanlock[8476]: 2014-05-14 23:33:06-0400 5909669 [8476]: s2 all pids clear
May 14 23:33:07 lnx903 wdmd[3627]: /dev/watchdog reopen
May 14 23:33:35 lnx903 rgmanager[3795]: status on vm "lnx99" returned 1 (generic error)
May 14 23:33:35 lnx903 rgmanager[3795]: status on vm "lnx91" returned 1 (generic error)
May 14 23:33:45 lnx903 rgmanager[3795]: Stopping service vm:lnx99
May 14 23:33:45 lnx903 rgmanager[3795]: Stopping service vm:lnx91
May 14 23:33:46 lnx903 rgmanager[3795]: Service vm:lnx99 is recovering
May 14 23:33:46 lnx903 rgmanager[3795]: Service vm:lnx91 is recovering
May 14 23:34:34 lnx903 sanlock[8476]: 2014-05-14 23:34:34-0400 5909758 [6907]: s2 renewed 5909607 delta_length 151 too long
May 14 23:36:01 lnx903 rgmanager[3795]: Service vm:lnx99 is stopped
May 14 23:36:06 lnx903 rgmanager[3795]: Service vm:lnx91 is stopped
...
———

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster