Hi cluster men,
we are in the process of building a cluster to virtualization a lot
of low-end servers using xen. Our plan is to use rhcs and clvm
for this but iLO insists on not working... :-|
The cluster has 2 nodes, two HP ProLiant DL580 G5 (x86_64).
We're using multi-vlan access to reach a lot of networks
and EMC symmetrix more multipath to share the disks. Well,
everything is ok except when I need to use iLO to provide
one secure way for ha. Follows my cluster.conf:
<?xml version="1.0"?>
<cluster name="alpha" config_version="3">
<cman two_node="0" expected_votes="3"/>
<clusternodes>
<clusternode name="node1.ha" votes="1" nodeid="1"/>
<fence>
<method name="1">
<device name="ilo-node1"/>
</method>
<method name="2">
<device name="manual" nodename="node1.ha"/>
</method>
</fence>
<clusternode name="node2.ha" votes="1" nodeid="2"/>
<fence>
<method name="1">
<device name="ilo-node2"/>
</method>
<method name="2">
<device name="manual" nodename="node2.ha"/>
</method>
</fence>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_ilo" hostname="10.127.255.129"
login="Administrator" name="ilo-node1" passwd="xxxx"/>
<fencedevice agent="fence_ilo" hostname="10.127.255.130"
login="Administrator" name="ilo-node2" passwd="xxxx"/>
<fencedevice agent="fence_manual" name="manual"/>
</fencedevices>
<quorumd device="/dev/mapper/3600604800002877515624d4630383434p1" tko="10"
votes="1" log_facility="local6" log_level="7" min_score="1" interval="1">
<heuristic interval="4" tko="3" program="ping -c1 -t3 10.10.10.1"
score="1"/>
<heuristic interval="4" tko="3" program="ping -c1 -t3 10.10.10.2"
score="1"/>
</quorumd>
<rm log_facility="local5" log_level="7">
<failoverdomains>
<failoverdomain name="para_dom" nofailback="1" ordered="1"
restricted="0">
<failoverdomainnode name="node1.ha" priority="1"/>
<failoverdomainnode name="node2.ha" priority="2"/>
</failoverdomain>
<failoverdomain name="hvm_dom" nofailback="1" ordered="1"
restricted="0">
<failoverdomainnode name="node1.ha" priority="2"/>
<failoverdomainnode name="node2.ha" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources/>
<vm autostart="1" domain="para_dom" exclusive="0" migrate="live"
name="rh52-para-virt01" path="/etc/xen"/>
<vm autostart="1" domain="hvm_dom" exclusive="0" migrate="live"
name="w2003-vm01" path="/etc/xen"/>
</rm>
</cluster>
node1# clustat
Cluster Status for alpha @ Sun Oct 26 21:32:52 2008
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
node1.ha 1
Online, Local, rgmanager
node2.ha 2
Online, rgmanager
/dev/mapper/3600604800002877515624d4630383434p1 0
Online, Quorum Disk
Service Name Owner (Last)
State
------- ---- ----- ------
-----
vm:rh52-para-virt01 node1.ha
started
vm:w2003-vm01 node2.ha
started
Look, when I try to fence the another node it doesn't works.
node1# fence_node node2.ha
node1# echo $?
1
node1# tail -1 /var/log/messages
Oct 26 21:44:44 xxxxx fence_node[1480]: Fence of "node2.ha" was unsuccessful
But if I try to fence via agent it works fine.
node1# ./fence_ilo -o off -l Administrator -p xxxx -a 10.127.255.130
success
echo $?
0
# clustat
Cluster Status for alpha @ Sun Oct 26 21:56:36 2008
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
node1.ha 1
Online, Local, rgmanager
node2.ha 2
Offline
/dev/mapper/3600604800002877515624d4630383434p1 0
Online, Quorum Disk
Service Name Owner (Last)
State
------- ---- ----- ------
-----
vm:rh52-para-virt01 node1.ha
started
vm:w2003-vm01 node2.ha
started
Now node2 is offline but the service remains there,
that is, node1 doesn't take over the vm:w2003-vm01
from node2. Follow the messages.log.
node1# tail -50 /var/log/messages
Oct 26 21:44:44 xxxxx fence_node[1480]: Fence of "node2.ha" was unsuccessful
Oct 26 21:54:49 xxxxx openais[31517]: [TOTEM] The token was lost in the
OPERATIONAL state.
Oct 26 21:54:49 xxxxx openais[31517]: [TOTEM] Receive multicast socket recv
buffer size (288000 bytes).
Oct 26 21:54:49 xxxxx openais[31517]: [TOTEM] Transmit multicast socket send
buffer size (262142 bytes).
Oct 26 21:54:49 xxxxx openais[31517]: [TOTEM] entering GATHER state from 2.
Oct 26 21:54:50 xxxxx qdiskd[31565]: <notice> Writing eviction notice for
node 2
Oct 26 21:54:51 xxxxx qdiskd[31565]: <notice> Node 2 evicted
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] entering GATHER state from 0.
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] Creating commit token because
I am the rep.
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] Saving state aru 75 high seq
received 75
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] Storing new sequence id for
ring 14ac
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] entering COMMIT state.
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] entering RECOVERY state.
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] position [0] member
10.127.255.137:
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] previous ring seq 5288 rep
10.127.255.137
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] aru 75 high delivered 75
received flag 1
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] Did not need to originate any
messages in recovery.
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] Sending initial ORF token
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] CLM CONFIGURATION CHANGE
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] New Configuration:
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] r(0) ip(10.127.255.137)
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] Members Left:
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] r(0) ip(10.127.255.138)
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] Members Joined:
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] CLM CONFIGURATION CHANGE
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] New Configuration:
Oct 26 21:54:54 xxxxx clurgmgrd[31715]: <info> State change: node2.ha DOWN
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] r(0) ip(10.127.255.137)
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] Members Left:
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] Members Joined:
Oct 26 21:54:54 xxxxx openais[31517]: [SYNC ] This node is within the
primary component and will provide service.
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] entering OPERATIONAL state.
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] got nodejoin message
10.127.255.137
Oct 26 21:54:54 xxxxx openais[31517]: [CPG ] got joinlist message from node
1
Oct 26 21:54:54 xxxxx kernel: dlm: closing connection to node 2
Oct 26 21:54:54 xxxxx fenced[31533]: node2.ha not a cluster member after 0
sec post_fail_delay
Oct 26 21:54:54 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 21:54:54 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 21:54:59 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 21:54:59 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 21:55:04 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 21:55:04 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 21:55:09 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 21:55:09 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 21:55:14 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 21:55:14 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 21:55:19 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 21:55:19 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 21:55:24 xxxxx fenced[31533]: fencing node "node2.ha"
Until I to force via fenced_override
node1# echo node2.ha > /var/run/cluster/fenced_override
tail -1 /var/log/messages
Oct 26 22:05:08 xxxxx clurgmgrd[31715]: <notice> Taking over service
vm:w2003-vm01 from down member node2.ha
Another example, if I simply to put the iface of heartbeat to
off on node2 (for simulate the problem), the same thing happens.
node2# ifconfig eth1 down
node1# tail -50 /var/log/messages
Oct 26 23:39:07 xxxxx openais[31517]: [TOTEM] The token was lost in the
OPERATIONAL state.
Oct 26 23:39:07 xxxxx openais[31517]: [TOTEM] Receive multicast socket recv
buffer size (288000 bytes).
Oct 26 23:39:07 xxxxx openais[31517]: [TOTEM] Transmit multicast socket send
buffer size (262142 bytes).
Oct 26 23:39:07 xxxxx openais[31517]: [TOTEM] entering GATHER state from 2.
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] entering GATHER state from 0.
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] Creating commit token because
I am the rep.
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] Saving state aru 52 high seq
received 52
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] Storing new sequence id for
ring 14b4
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] entering COMMIT state.
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] entering RECOVERY state.
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] position [0] member
10.127.255.137:
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] previous ring seq 5296 rep
10.127.255.137
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] aru 52 high delivered 52
received flag 1
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] Did not need to originate any
messages in recovery.
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] Sending initial ORF token
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] CLM CONFIGURATION CHANGE
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] New Configuration:
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] r(0) ip(10.127.255.137)
Oct 26 23:39:12 xxxxx clurgmgrd[31715]: <info> State change: node2.ha DOWN
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] Members Left:
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] r(0) ip(10.127.255.138)
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] Members Joined:
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] CLM CONFIGURATION CHANGE
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] New Configuration:
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] r(0) ip(10.127.255.137)
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] Members Left:
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] Members Joined:
Oct 26 23:39:12 xxxxx openais[31517]: [SYNC ] This node is within the
primary component and will provide service.
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] entering OPERATIONAL state.
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] got nodejoin message
10.127.255.137
Oct 26 23:39:12 xxxxx openais[31517]: [CPG ] got joinlist message from node
1
Oct 26 23:39:12 xxxxx kernel: dlm: closing connection to node 2
Oct 26 23:39:12 xxxxx fenced[31533]: node2.ha not a cluster member after 0
sec post_fail_delay
Oct 26 23:39:12 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 23:39:12 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 23:39:17 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 23:39:17 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 23:39:22 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 23:39:22 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 23:39:27 xxxxx fenced[31533]: fencing node "node2.ha"
node1# clustat
Cluster Status for alpha @ Sun Oct 26 23:41:20 2008
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
node1.ha 1
Online, Local, rgmanager
node2.ha 2
Offline
/dev/mapper/3600604800002877515624d4630383434p1 0
Online, Quorum Disk
Service Name Owner (Last)
State
------- ---- ----- ------
-----
vm:rh52-para-virt01 node1.ha
started
vm:w2003-vm01 node2.ha
started
I believe that node1 had power off node2 via iLO because node2
don't responded anymore but node1 didn't take over the service
like it should to do.
Finally for try to solve this problem I loaded these modules on both nodes
from hp-OpenIPMI-8.1.0-104.rhel5.rpm package but nothing changed.
/opt/hp/hp-OpenIPMI/bin/2.6.18-92.el5xen/ipmi_devintf.ko
/opt/hp/hp-OpenIPMI/bin/2.6.18-92.el5xen/ipmi_msghandler.ko
/opt/hp/hp-OpenIPMI/bin/2.6.18-92.el5xen/ipmi_poweroff.ko
/opt/hp/hp-OpenIPMI/bin/2.6.18-92.el5xen/ipmi_si.ko
/opt/hp/hp-OpenIPMI/bin/2.6.18-92.el5xen/ipmi_watchdog.ko
ps. I'm using one rh5.2, kernel-2.6.18-92.el5, cman-2.0.84-2.el5,
rgmanager-2.0.38-2.el5 and iLO 1.50 on HPs.
tks a lot.
--
Renan
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster