Unfortunately , you need a shared disk to run qdisk, it can not work in
"diskless" mode right now.
ext Brett Cave wrote:
On Wed, Feb 25, 2009 at 11:45 AM, Mockey Chen <mockey.chen@xxxxxxx> wrote:
ext Kein He wrote:
I think there is a problem, from "cman_tool status" shows:
Nodes: 2
Expected votes: 3
Total votes: 2
according to your cluster.conf , if all nodes and qdisk are online,
the "Total votes" must be "3". Probably "qdiskd" is not running, you
can use " cman_tool nodes" to check if qdisk is working.
Yes, here is "cman_tool nodes" output:
Node Sts Inc Joined Name
1 M 112 2009-02-25 03:05:19 as-1.localdomain
2 M 104 2009-02-25 03:05:19 as-2.localdomain
A question is how to check whether qdisk is running ? and how to run it ?
[root@blade3 ~]# service qdiskd status
qdiskd (pid 2832) is running...
[root@blade3 ~]# pgrep qdisk -l
2832 qdiskd
[root@blade3 ~]# cman_tool nodes
Node Sts Inc Joined Name
0 M 0 2009-02-19 16:11:55 /dev/sda5 ## This is qdisk.
1 M 1524 2009-02-20 22:27:32 blade1
2 M 1552 2009-02-24 04:39:24 blade2
3 M 1500 2009-02-19 16:11:03 blade3
4 M 1516 2009-02-19 16:11:22 blade4
You can use "service qdisk start" to start it, or run it with
/usr/sbin/qdisk -Q if you dont have the init script. If you installed
from rpm on a rh type distro, then the script should be there.
REgards,
brett
I try to use "service qdiskd start", but it failed:
[root@as-2 ~]# service qdiskd start
Starting the Quorum Disk Daemon: [FAILED]
[root@as-2 ~]# tail /var/log/messages
...
Feb 26 09:19:40 as-2 qdiskd[14707]: <crit> Unable to match label
'testing' to any device
Feb 26 09:19:46 as-2 clurgmgrd[4032]: <notice> Reconfiguring
Here is my qdisk configuration, I copy it from "man qdisk":
<quorumd interval="1" tko="10" votes="1" label="testing">
<heuristic program="ping 10.56.150.1 -c1 -t1" score="1"
interval="2" tko="3"/>
</quorumd>
How to map label to device. Note: I did not have any shared storage.
Thanks.
Mockey Chen wrote:
ext Mockey Chen wrote:
ext Kein He wrote:
Hi Mockey,
Could you please attach the output from " cman_tool status " and "
cman_tool nodes -f" ?
Thanks your response.
I try to run cman_tool status on as-2, but it hang, without output, and
even Ctrl+C also no effect.
I manually reboot as-1, and the problem solved.
There is the output of cman_tool
[root@as-1 ~]# cman_tool status
Version: 6.1.0
Config Version: 19
Cluster Name: azerothcluster
Cluster Id: 20148
Cluster Member: Yes
Cluster Generation: 76
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Quorum: 2 Active subsystems: 8
Flags: Dirty
Ports Bound: 0 177 Node name: as-1.localdomain
Node ID: 1
Multicast addresses: 239.192.78.3
Node addresses: 10.56.150.3
[root@as-1 ~]# cman_tool status -f
Version: 6.1.0
Config Version: 19
Cluster Name: azerothcluster
Cluster Id: 20148
Cluster Member: Yes
Cluster Generation: 76
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Quorum: 2 Active subsystems: 8
Flags: Dirty
Ports Bound: 0 177 Node name: as-1.localdomain
Node ID: 1
Multicast addresses: 239.192.78.3
Node addresses: 10.56.150.3
It seems cluster can not fence one of the node. How to solve it ?
I open a new window and can using ssh to as-2, but after login, I can
not do anything, even a
simple 'ls' command is hung.
It seem the system keep alive but do not provide any service. Really
bad.
Any way to debug this issue ?
Mockey Chen wrote:
Hi,
I have a two-nodes cluster, to avoid split-brain. I use ilo as fence
device, IP tiebreaker. here is my /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="azerothcluster" config_version="19"
name="azerothcluster">
<cman expected_votes="3" two_node="0"/>
<clusternodes>
<clusternode name="as-1.localdomain" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="ilo1"/>
</method>
</fence>
</clusternode>
<clusternode name="as-2.localdomain" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="ilo2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<quorumd interval="1" tko="10" votes="1" label="pingtest">
<heuristic program="ping 10.56.150.1 -c1 -t1"
score="1"
interval="2" tko="3"/>
</quorumd>
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<fencedevices>
<fencedevice agent="fence_ilo" hostname="10.56.154.18"
login="power" name="ilo1" passwd="pass"/>
<fencedevice agent="fence_ilo" hostname="10.56.154.19"
login="power" name="ilo2" passwd="pass"/>
</fencedevices>
...
...
To test one node lost heartbeat case, I disable ethereal card
(eth0) on
as-1, I expect as-2 takeover services on as-1 and as-1 node reboot.
The actual is as-1 lost connection to as-2. as-2 detected it and
try to
re-construct cluster, but failed, here is the syslog form as-2
Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] The token was lost in the
OPERATIONAL state.
Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Transmit multicast socket
send buffer size (262142 bytes).
Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] entering GATHER state
from 2.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering GATHER state
from 0.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Creating commit token
because I am the rep.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Saving state aru 1f4 high
seq received 1f4
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Storing new sequence
id for
ring 2c
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering COMMIT state.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering RECOVERY state.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] position [0] member
10.56.150.4:
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] previous ring seq 40 rep
10.56.150.3
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] aru 1f4 high delivered
1f4
received flag 1
Message from syslogd@ at Tue Feb 24 21:25:40 2009 ...
as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved Feb 24 21:25:40
as-2
openais[4139]: [TOTEM] Did not need to originate any messages in
recovery.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Sending initial ORF token
Feb 24 21:25:40 as-2 openais[4139]: [CLM ] CLM CONFIGURATION CHANGE
Feb 24 21:25:40 as-2 openais[4139]: [CLM ] New Configuration:
Feb 24 21:25:40 as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved
Feb 24 21:25:40 as-2 kernel: dlm: closing connection to node 1
Feb 24 21:25:40 as-2 openais[4139]: [CLM ] r(0) ip(10.56.150.4)
Feb 24 21:25:40 as-2 openais[4139]: [CLM ] Members Left:
Feb 24 21:25:40 as-2 openais[4139]: [CLM ] r(0) ip(10.56.150.3)
Feb 24 21:25:40 as-2 openais[4139]: [CLM ] Members Joined:
Feb 24 21:25:40 as-2 openais[4139]: [CMAN ] quorum lost, blocking
activity
Feb 24 21:25:40 as-2 openais[4139]: [CLM ] CLM CONFIGURATION CHANGE
Feb 24 21:25:40 as-2 openais[4139]: [CLM ] New Configuration:
Feb 24 21:25:40 as-2 openais[4139]: [CLM ] r(0) ip(10.56.150.4)
Feb 24 21:25:40 as-2 openais[4139]: [CLM ] Members Left:
Feb 24 21:25:40 as-2 openais[4139]: [CLM ] Members Joined:
Feb 24 21:25:40 as-2 openais[4139]: [SYNC ] This node is within the
primary component and will provide service.
Feb 24 21:25:40 as-2 ccsd[4130]: Cluster is not quorate. Refusing
connection.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering OPERATIONAL
state.
Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing connect:
Connection refused
Feb 24 21:25:40 as-2 openais[4139]: [CLM ] got nodejoin message
10.56.150.4
Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
Feb 24 21:25:40 as-2 openais[4139]: [CPG ] got joinlist message from
node 2
Feb 24 21:25:40 as-2 ccsd[4130]: Someone may be attempting something
evil.
Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing get: Invalid
request descriptor
Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
evil.
Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing get: Invalid
request descriptor
Feb 24 21:25:41 as-2 ccsd[4130]: Invalid descriptor specified (-21).
Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
evil.
Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing disconnect:
Invalid request descriptor
Feb 24 21:25:41 as-2 avahi-daemon[3756]: Withdrawing address
record for
10.56.150.144 on eth0.
Feb 24 21:25:41 as-2 in.rdiscd[8641]: setsockopt (IP_ADD_MEMBERSHIP):
Address already in use
Feb 24 21:25:41 as-2 in.rdiscd[8641]: Failed joining addresse
I also found there are some errors in as-1's syslog
Feb 25 11:27:09 as-1 clurgmgrd[4332]: <err> #52: Failed changing RG
status
Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> Link for eth0: Not
detected
Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> No link on eth0...
...
Feb 25 11:27:36 as-1 ccsd[4268]: Unable to connect to cluster
infrastructure after 30 seconds.
...
Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
infrastructure after 60 seconds.
...
Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
infrastructure after 90 seconds.
any comment is appreciated!
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster