Hi, I have problem with 2 nodes cluster runnig xen virtual machines. The configuration is very simple. Node 1 - d1 runs vm_service1 and node 2 - d2 runs vm_service2 and have configured APC Master Switch as fence devices. Everything works well: starting, stopping and migrating virtual services between nodes. But the problem occurs when I try to test crash one of the nodes by, for example, shutting down node d2. In this case node d1 discovers node d2 failed and fences it through APC device. After node d2 is up it joins cluster and try to relocate vm_service2. But during that I get strange logs on node d2: Jan 31 21:18:11 d2 openais[5485]: [TOTEM] entering OPERATIONAL state. Jan 31 21:18:11 d2 openais[5485]: [CLM ] got nodejoin message 10.0.200.101 Jan 31 21:18:11 d2 openais[5485]: [CLM ] got nodejoin message 10.0.200.102 Jan 31 21:18:11 d2 openais[5485]: [CPG ] got joinlist message from node 2 Jan 31 21:18:45 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:18:46 d2 openais[5485]: [TOTEM] Retransmit List: 31 .... Jan 31 21:19:10 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:11 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:11 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:11 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:12 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:15 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:15 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:16 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:16 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:16 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:16 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:17 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:17 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:18 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:18 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:18 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:19 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:19 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:19 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:20 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:20 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:20 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:20 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:21 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:21 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:21 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:21 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:21 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:21 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:23 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:23 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:23 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:24 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:24 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:24 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:26 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:26 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:26 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:27 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:27 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:27 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:29 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:29 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:29 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:30 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:30 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:30 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:32 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:32 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:32 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:33 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:33 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:33 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:35 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:35 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:35 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:36 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:36 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:36 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:38 d2 openais[5485]: [TOTEM] Retransmit List: 31 Jan 31 21:19:38 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE Jan 31 21:19:38 d2 openais[5485]: [TOTEM] entering GATHER state from 6. Jan 31 21:19:43 d2 openais[5485]: [TOTEM] entering GATHER state from 0. Jan 31 21:20:18 d2 openais[5485]: [TOTEM] The consensus timeout expired. Jan 31 21:20:18 d2 openais[5485]: [TOTEM] entering GATHER state from 3. Jan 31 21:20:52 d2 openais[5485]: [TOTEM] The consensus timeout expired. Jan 31 21:20:52 d2 openais[5485]: [TOTEM] entering GATHER state from 3. And on node d2: Jan 31 21:18:08 d1 openais[5467]: [CLM ] CLM CONFIGURATION CHANGE Jan 31 21:18:08 d1 openais[5467]: [CLM ] New Configuration: Jan 31 21:18:08 d1 openais[5467]: [CLM ] r(0) ip(10.0.200.101) Jan 31 21:18:08 d1 openais[5467]: [CLM ] Members Left: Jan 31 21:18:08 d1 openais[5467]: [CLM ] Members Joined: Jan 31 21:18:08 d1 openais[5467]: [CLM ] CLM CONFIGURATION CHANGE Jan 31 21:18:08 d1 openais[5467]: [CLM ] New Configuration: Jan 31 21:18:09 d1 openais[5467]: [CLM ] r(0) ip(10.0.200.101) Jan 31 21:18:09 d1 openais[5467]: [CLM ] r(0) ip(10.0.200.102) Jan 31 21:18:09 d1 openais[5467]: [CLM ] Members Left: Jan 31 21:18:09 d1 openais[5467]: [CLM ] Members Joined: Jan 31 21:18:09 d1 openais[5467]: [CLM ] r(0) ip(10.0.200.102) Jan 31 21:18:09 d1 openais[5467]: [SYNC ] This node is within the primary component and will provide service. Jan 31 21:18:09 d1 openais[5467]: [TOTEM] entering OPERATIONAL state. Jan 31 21:18:09 d1 openais[5467]: [CLM ] got nodejoin message 10.0.200.101 Jan 31 21:18:10 d1 openais[5467]: [CLM ] got nodejoin message 10.0.200.102 Jan 31 21:18:10 d1 openais[5467]: [CPG ] got joinlist message from node 2 Jan 31 21:18:15 d1 kernel: dlm: connecting to 1 Jan 31 21:18:15 d1 kernel: dlm: got connection from 1 Jan 31 21:19:47 d1 openais[5467]: [TOTEM] The token was lost in the OPERATIONAL state. Jan 31 21:19:47 d1 openais[5467]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Jan 31 21:19:47 d1 openais[5467]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes). Jan 31 21:19:47 d1 openais[5467]: [TOTEM] entering GATHER state from 2. Jan 31 21:19:52 d1 openais[5467]: [TOTEM] entering GATHER state from 0. Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Creating commit token because I am the rep. Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Saving state aru 30 high seq received 31 Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Storing new sequence id for ring 4bc Jan 31 21:19:52 d1 openais[5467]: [TOTEM] entering COMMIT state. Jan 31 21:19:52 d1 openais[5467]: [TOTEM] entering RECOVERY state. Jan 31 21:19:52 d1 openais[5467]: [TOTEM] position [0] member 10.0.200.101: Jan 31 21:19:52 d1 kernel: dlm: closing connection to node 1 Jan 31 21:19:52 d1 openais[5467]: [TOTEM] previous ring seq 1208 rep 10.0.200.101 Jan 31 21:19:52 d1 openais[5467]: [TOTEM] aru 30 high delivered 30 received flag 0 Jan 31 21:19:52 d1 openais[5467]: [TOTEM] copying all old ring messages from 31-31. Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Originated 0 messages in RECOVERY. Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Originated for recovery: Jan 31 21:19:52 d1 fenced[5484]: d2.local.polska.pl not a cluster member after 0 sec post_fail_delay Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Not Originated for recovery: 31 Jan 31 21:19:52 d1 fenced[5484]: fencing node "d2" Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Sending initial ORF token Jan 31 21:19:53 d1 fenced[5484]: fence "d2" success In consequence, I cannot start cluster because node d1 constantly fences node d2. Making some research I find out that the problem might be in xen networking. During staring virtual service the xen bridges are reconfigurating (am I wrong?) and therefore there is a problem with communication between nodes. But I don't know what to do with xen configuration the cluster starts working. Cheers Agnieszka Kukalowicz -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster