Re: Xen network config -> Fence problem - More info

Madison Kelly <linux@xxxxxxxxxxx> · Sat, 31 Oct 2009 01:01:23 -0400

  After sending this, I went back to debugging the problem. The 
machines had stopped fencing and the DRBD link was down.

  So first I stopped and then started 'xend' and this got the Xen-type 
networking up. I left the machines alone for about ten minutes to see if 
they would fence one another, they didn't.

  So then I set about fixing DRBD. I got the array re-sync'ing and I 
thought I might have gotten things working, but about 15 or 30 seconds 
after getting the DRBD back online, one node fenced the other again. It 
may have been a coincidence, but the last command I called before one 
node fenced the other was 'pvdisplay' to check the LVM PVs. That command 
didn't return, and may have been the trigger, I am not sure.

  So it looks like they fence each other until DRBD breaks. Once array 
is fixed and/or pvdisplay is called, the fence loop starts again.

Madi

Madison Kelly wrote:
Hi all,

  I've got CentOS 5.3 installed on two nodes (simple two node cluster). 
On this, I've got a DRBD partition running cluster aware LVM. I use this 
to host VMs under Xen.

  I've got a problem where I am trying to use eth0 as a back channel for 
the VMs on either node via a firewall VM. The network setup on each node 
is:

eth0: back channel, IPMI only connected to an internal network.
eth1: dedicated DRBD link.
eth2: Internet-facing interface.

  I want to get eth0 and eth2 under Xen's networking but the default 
config was to leave eth0 alone. Specifically, the 
convirt-xen-multibridge is set to:

"$dir/network-bridge" "$@" vifnum=0 netdev=peth0 bridge=xenbr0

  When I change this to:

"$dir/network-bridge" "$@" vifnum=0 netdev=eth0 bridge=xenbr0

  One of the nodes will soon fence the other, and when it comes back up 
it fences the first. Eventually one node stays up and constantly fences 
the other.

  The node that survives prints this to repeatedly to the log just 
before it is fenced:

Oct 31 00:27:21 vsh02 openais[3133]: [TOTEM] FAILED TO RECEIVE
Oct 31 00:27:21 vsh02 openais[3133]: [TOTEM] entering GATHER state from 6.

  And the node that stays up prints this:

Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] The token was lost in the 
OPERATIONAL state.
Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] Receive multicast socket 
recv buffer size (288000 bytes).
Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] Transmit multicast socket 
send buffer size (262142 bytes).
Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] entering GATHER state from 2.
Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering GATHER state from 0.
Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Creating commit token 
because I am the rep.
Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Saving state aru 2c high 
seq received 2c
Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Storing new sequence id for 
ring 108
Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering COMMIT state.
Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering RECOVERY state.
Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] position [0] member 
10.255.135.3:
Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] previous ring seq 260 rep 
10.255.135.2
Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] aru 2c high delivered 2c 
received flag 1
Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Did not need to originate 
any messages in recovery.
Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Sending initial ORF token
Oct 31 00:35:51 vsh03 openais[3237]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 31 00:35:51 vsh03 openais[3237]: [CLM  ] New Configuration:
Oct 31 00:35:51 vsh03 kernel: dlm: closing connection to node 1
Oct 31 00:35:51 vsh03 fenced[3256]: vsh02.domain.com not a cluster 
member after 0 sec post_fail_delay
Oct 31 00:35:51 vsh03 openais[3237]: [CLM  ]     r(0) ip(10.255.135.3)
Oct 31 00:35:51 vsh03 fenced[3256]: fencing node "vsh02.domain.com"

  If I leave it long enough, the failed node (vsh02 in this case), stops 
getting fenced but the Xen networking doesn't come up. Specifically, no 
vifX.Y, xenbrX or other devices get created.

  Any idea what might be going on? I really need to get eth0 virtualized 
so that I can get routing to work.

Thanks!

Madi

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster