Hello, I have 2 x rh el 6.0 hosts (rhev1 and rhev2) where I enabled ha and resilient storage beta channels. I'm testing from the Beta HA Addon channel the checkpoint backend This necessary because I want to test managing clusters of rh el 5 guests (where for example I would keep one guest restricted to the first host and the other one to the another host). fence_virtd package is in 6.0 official fence-virtd-checkpoint that I want to use is instead only in beta channels I see from release notes that probably this feature, that is a tech preview right now, could become officially supported in upcoming rh el 6.1. the 2 guests (vorastud1 and vorastud2) compose a rh el 5.6 based cluster I configured fence-virtd in hosts with its default config using multicast listener and checkpoint backend (btw: I had to manually run "chkconfig --add fence_virtd" ... donna if it is intended due to testing purposes) Inside the guests I configured the fence_xvm provided by rh el 5.6 Some notes of mine, to share doubts and solutions, if possible: - the hosts use a bridged device brvlan66 configured with vlan over bonding: bond0 --> bond0.66 --> brvlan66 (there is a bug opened so far on this config, so at the moment I ifdown one of the interfaces composing the bond device) - the hosts use also a bridged device brvlan65 configured with vlan over bonding on the same bond0: bond0 --> bond0.65 --> brvlan65 - the guests are configured with production over brvlan66 and cluster over brvlan65 - To have multicast traffic used by guests' fencing go thorough, I have to put this rule in /etc/sysconfig/iptables of hosts: -I INPUT -d 225.0.0.12 -j ACCEPT I initially tried to log the traffic and got this: Mar 24 11:04:57 rhev1 kernel: IN=brvlan66 OUT= MAC=01:00:5e:00:00:0c:00:1e:79:2c:e2:88:08:00 SRC=10.192.15.65 DST=225.0.0.12 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=35426 PROTO=2 So I attempt -I INPUT -m physdev --physdev-in brvlan66 -d 225.0.0.12 -j ACCEPT but this doesn't work (it doesn't either log....) So one question is what would be a more restrictive rule to put? - considering my guests' cluster nodes vorastud1==vnode02 vorastud2==vnode01 (it's weird because it is a duplicate of an existing so configured one ;-) vorastud1 runs on rhev1 [root@rhev1 ~]# virsh list Id Name State ---------------------------------- 3 vorastud1 running vorastud2 runs on rhev2 [root@rhev2 ~]# virsh list | grep orast 10 vorastud2 running vorastud1 10.4.5.164 (10.4.4.51 on intracluster) vorastud2 10.4.5.165 (10.4.4.52 on intracluster) guests' cluster vip 10.4.5.166 No iptables running on guests my guests' cluster fencing device section is based on what read on http://sources.redhat.com/cluster/wiki/XVM_FencingConfig and http://sources.redhat.com/cluster/wiki/VMClusterCookbook <clusternode name="vnode01" nodeid="1" votes="1"> <fence> <method name="2"> <device domain="vorastud2" name="xvm-rhev2"/> </method> <method name="1"> <device domain="vorastud2" name="xvm-rhev1"/> </method> </fence> </clusternode> <fencedevices> <fencedevice name="xvm-rhev1" agent="fence_xvm" key_file="/etc/cluster/rhev1.key"/> <fencedevice name="xvm-rhev2" agent="fence_xvm" key_file="/etc/cluster/rhev2.key"/> </fencedevices> The keys on hosts are different. I had to configure this double method because sometimes it seems I can reach a domain only through one host (that not necessarily is the one where the guest is running). Could it be considered useful or not in general to have this redundancy of configuration? Does it make sense to use an host for guests running on other hosts? - sometimes fence_virtd seems to hang [root@rhev1 ~]# service fence_virtd status fence_virtd (pid 2849) is running... [root@rhev1 ~]# strace -p 2849 Process 2849 attached - interrupt to quit futex(0x7f751373f604, FUTEX_WAIT_PRIVATE, 1, NULL but [root@vorastud1 ~]# fence_xvm -H vorastud2 -k /etc/cluster/rhev1.key -ddd -o null Debugging threshold is now 3 -- args @ 0x7fff5acdda40 -- args->addr = 225.0.0.12 args->domain = vorastud2 args->key_file = /etc/cluster/rhev1.key args->op = 0 args->hash = 2 args->auth = 2 args->port = 1229 args->ifindex = 0 args->family = 2 args->timeout = 30 args->retr_time = 20 args->flags = 0 args->debug = 3 -- end args -- Reading in key file /etc/cluster/rhev1.key into 0x7fff5acdc9f0 (4096 max size) Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1 Sending to 225.0.0.12 via 10.4.5.164 Sending to 225.0.0.12 via 10.4.5.166 Sending to 225.0.0.12 via 10.4.4.51 Waiting for connection from XVM host daemon. [and so on retrying] no sort of iptables logging in messages..... if I now run [root@rhev1 ~]# service fence_virtd force-reload Stopping fence_virtd: [ OK ] Starting fence_virtd: [ OK ] [root@rhev1 ~]# service fence_virtd status fence_virtd (pid 2166) is running... and the fence command now is successful: [root@vorastud1 ~]# fence_xvm -H vorastud2 -k /etc/cluster/rhev1.key -ddd -o null Debugging threshold is now 3 -- args @ 0x7fff65d3e360 -- args->addr = 225.0.0.12 args->domain = vorastud2 args->key_file = /etc/cluster/rhev1.key args->op = 0 args->hash = 2 args->auth = 2 args->port = 1229 args->ifindex = 0 args->family = 2 args->timeout = 30 args->retr_time = 20 args->flags = 0 args->debug = 3 -- end args -- Reading in key file /etc/cluster/rhev1.key into 0x7fff65d3d310 (4096 max size) Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1 Sending to 225.0.0.12 via 10.4.5.164 Sending to 225.0.0.12 via 10.4.5.166 Sending to 225.0.0.12 via 10.4.4.51 Waiting for connection from XVM host daemon. Issuing TCP challenge Responding to TCP challenge TCP Exchange + Authentication done... Waiting for return value from XVM host Remote: Operation failed And in messages of rhev1 Mar 29 16:13:48 rhev1 kernel: IN=brvlan66 OUT= MAC=01:00:5e:00:00:0c:52:54:00:08:50:b3:08:00 SRC=10.4.5.164 DST=225.0.0.12 LEN=204 TOS=0x00 PREC=0x00 TTL=2 ID=0 DF PROTO=UDP SPT=29349 DPT=1229 LEN=184 Mar 29 16:13:48 rhev1 kernel: IN=brvlan66 OUT= MAC=01:00:5e:00:00:0c:52:54:00:08:50:b3:08:00 SRC=10.4.5.166 DST=225.0.0.12 LEN=204 TOS=0x00 PREC=0x00 TTL=2 ID=0 DF PROTO=UDP SPT=42058 DPT=1229 LEN=184 Mar 29 16:15:23 rhev1 kernel: IN=brvlan66 OUT= MAC=01:00:5e:00:00:0c:00:0b:bf:89:06:40:08:00 SRC=10.4.5.161 DST=225.0.0.12 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 PROTO=2 Mar 29 16:15:24 rhev1 kernel: IN=brvlan66 OUT= MAC=01:00:5e:00:00:0c:00:1e:79:2c:e2:80:08:00 SRC=10.4.5.161 DST=225.0.0.12 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=17217 PROTO=2 Mar 29 16:15:24 rhev1 kernel: IN=brvlan66 OUT= MAC=01:00:5e:00:00:0c:00:0b:bf:89:06:40:08:00 SRC=10.4.5.161 DST=225.0.0.12 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 PROTO=2 [root@rhev1 ~]# service fence_virtd status fence_virtd (pid 2166) is running... Now the strace command gives a different output when nothing using it.... [root@rhev1 ~]# strace -p 2166 Process 2166 attached - interrupt to quit select(6, [5], NULL, NULL, NULL - After few minutes restart of both the fence_virtd daemons, [root@vorastud1 ~]# fence_xvm -H vorastud1 -k /etc/cluster/rhev1.key -ddd -o null Debugging threshold is now 3 -- args @ 0x7fffca3528b0 -- args->addr = 225.0.0.12 args->domain = vorastud1 args->key_file = /etc/cluster/rhev1.key args->op = 0 args->hash = 2 args->auth = 2 args->port = 1229 args->ifindex = 0 args->family = 2 args->timeout = 30 args->retr_time = 20 args->flags = 0 args->debug = 3 -- end args -- Reading in key file /etc/cluster/rhev1.key into 0x7fffca351860 (4096 max size) Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1 Sending to 225.0.0.12 via 10.4.5.164 Sending to 225.0.0.12 via 10.4.5.166 Sending to 225.0.0.12 via 10.4.4.51 Waiting for connection from XVM host daemon. [... keeps retrying...] the strace output of rhev1 shows no progress: [root@rhev1 ~]# strace -p 5168 Process 5168 attached - interrupt to quit select(6, [5], NULL, NULL, NULL while the strace of rhev2 (that has a different key from the one specified in command line) shows : [root@rhev2 ~]# strace -p 17145 Process 17145 attached - interrupt to quit ... recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(39460), sin_addr=inet_addr("10.4.5.164")}, [16]) = 176 select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(65166), sin_addr=inet_addr("10.4.5.166")}, [16]) = 176 select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(33259), sin_addr=inet_addr("10.4.5.164")}, [16]) = 176 write(1, "00000000000000000000000000000000"..., 4096) = 4096 select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(28562), sin_addr=inet_addr("10.4.5.166")}, [16]) = 176 select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(18599), sin_addr=inet_addr("10.4.5.164")}, [16]) = 176 select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(42607), sin_addr=inet_addr("10.4.5.166")}, [16]) = 176 select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(55231), sin_addr=inet_addr("10.4.5.164")}, [16]) = 176 select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(19253), sin_addr=inet_addr("10.4.5.166")}, [16]) = 176 select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(41675), sin_addr=inet_addr("10.4.5.164")}, [16]) = 176 select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(44621), sin_addr=inet_addr("10.4.5.166")}, [16]) = 176 select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(41881), sin_addr=inet_addr("10.4.5.164")}, [16]) = 176 select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(65021), sin_addr=inet_addr("10.4.5.166")}, [16]) = 176 If I force-reload fence_virtd on rhev1, it begins answering again.... [root@rhev1 ~]# service fence_virtd force-reload Stopping fence_virtd: [ OK ] Starting fence_virtd: [ OK ] [root@rhev1 ~]# service fence_virtd status fence_virtd (pid 5979) is running... [root@rhev1 ~]# strace -p 5979 Process 5979 attached - interrupt to quit select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(64221), sin_addr=inet_addr("10.4.5.164")}, [16]) = 176 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6 setsockopt(6, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 fcntl(6, F_GETFL) = 0x2 (flags O_RDWR) fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(6, {sa_family=AF_INET, sin_port=htons(1229), sin_addr=inet_addr("10.4.5.164")}, 16) = -1 EINPROGRESS (Operation now in progress) select(7, [6], [6], NULL, {5, 0}) = 1 (out [6], left {4, 997894}) getsockopt(6, SOL_SOCKET, SO_ERROR, [8589934592], [4]) = 0 fcntl(6, F_SETFL, O_RDWR) = 0 select(7, [6], NULL, NULL, {10, 0}) = 1 (in [6], left {9, 919600}) read(6, "A+\313\316\223\363\201\305\305s~\230\"\363\363g\332\225\363\32\335\266\17\333\252\363=\304N\1\322d"..., 64) = 64 write(6, "\21U\275\270;\345g\253\214\340\35\250\207&\323j[\303\307\324Y\257\301\353\353\1\312y\253\0\202\302"..., 64) = 64 open("/dev/urandom", O_RDONLY) = 7 read(7, "\34\203\16\25\337f\323z='\372\177\2115\2133\353\255z\222\245hJ\341sA\331\256\245\314x,"..., 64) = 64 close(7) = 0 write(6, "\34\203\16\25\337f\323z='\372\177\2115\2133\353\255z\222\245hJ\341sA\331\256\245\314x,"..., 64) = 64 select(7, [6], NULL, NULL, {10, 0}) = 1 (in [6], left {9, 999856}) read(6, "\20\301\366B\5\355\262@\345\227\260\316\30\367\341v\5\343S \30s\317H*|\277\221\16r\263\343"..., 64) = 64 write(6, "\1", 1) = 1 close(6) = 0 select(6, [5], NULL, NULL, NULL) = 1 (in [5]) recvfrom(5, "\0\2\4\0vorastud1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 176, 0, {sa_family=AF_INET, sin_port=htons(14528), sin_addr=inet_addr("10.4.5.166")}, [16]) = 176 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6 setsockopt(6, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 fcntl(6, F_GETFL) = 0x2 (flags O_RDWR) fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(6, {sa_family=AF_INET, sin_port=htons(1229), sin_addr=inet_addr("10.4.5.166")}, 16) = -1 EINPROGRESS (Operation now in progress) select(7, [6], [6], NULL, {5, 0}) = 1 (out [6], left {4, 999124}) getsockopt(6, SOL_SOCKET, SO_ERROR, [8589934592], [4]) = 0 fcntl(6, F_SETFL, O_RDWR) = 0 select(7, [6], NULL, NULL, {10, 0}) = 1 (in [6], left {9, 998486}) read(6, 0x7fff92689470, 64) = -1 ECONNRESET (Connection reset by peer) dup(2) = 7 fcntl(7, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) fstat(7, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0 ioctl(7, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff926890f0) = -1 ENOTTY (Inappropriate ioctl for device) mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd5e1f41000 lseek(7, 0, SEEK_CUR) = 0 write(7, "read: Connection reset by peer\n", 31) = 31 close(7) = 0 munmap(0x7fd5e1f41000, 4096) = 0 close(6) = 0 select(6, [5], NULL, NULL, NULL^C <unfinished ...> Process 5979 detached Gianluca -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster