Thomas Börnert wrote:
Hi List,
2 Servers - connected with crossover
my rpms:
gfs2-utils-0.1.38-1.el5
gfs-utils-0.1.12-1.el5
kmod-gfs2-1.52-1.16.el5
cman-2.0.73-1.el5_1.1
my cluster.conf on both sites
---------------------------------------------------------------------------------
<?xml version="1.0"?>
<cluster name="cluster" config_version="2">
<cman two_node="1" expected_votes="1">
</cman>
<clusternodes>
<clusternode name="node1" votes="1" nodeid="1">
<fence>
<method name="human">
<device name="human" nodename="node1"/>
</method>
</fence>
</clusternode>
<clusternode name="node2" votes="1" nodeid="2">
<fence>
<method name="human">
<device name="human" nodename="node2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="human" agent="fence_manual"/>
</fencedevices>
</cluster>
---------------------------------------------------------------------------------------
my hosts on both sites
192.168.0.1 node1
192.168.0.2 node2
my mountpoints
mkfs.gfs2 -p lock_dlm -t cluster:drbd -j 2 /dev/drbd0
mount -t gfs2 -o noatime,nodiratime /dev/drbd0 /test
(Btw: => drbd works fine as Primary/Primary)
ok, i can use /test on both sites and can write to files
and so on.
cman_tool nodes
--------------------------------------------------------------------------------------
Node Sts Inc Joined Name
1 M 364 2008-02-26 23:20:16 node1
2 M 360 2008-02-26 23:20:16 node2
cman_tool status
-------------------------------------------------------------------------------------
Version: 6.0.1
Config Version: 3
Cluster Name: cluster
Cluster Id: 34996
Cluster Member: Yes
Cluster Generation: 364
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1
Active subsystems: 6
Flags: 2node
Ports Bound: 0
Node name: node2
Node ID: 2
Multicast addresses: 239.192.136.61
Node addresses: 192.168.0.2
NOW: i power node1 off !
my log on node2 shows:
-----------------------------------------------------------------------------------------
==> /var/log/messages <==
Feb 26 23:27:22 node2 last message repeated 13 times
==> /var/log/kernel <==
Feb 26 23:27:31 node2 kernel: tg3: eth1: Link is down.
Feb 26 23:27:32 node2 kernel: tg3: eth1: Link is up at 100 Mbps, full duplex.
Feb 26 23:27:32 node2 kernel: tg3: eth1: Flow control is off for TX and off
for RX.
Feb 26 23:27:36 node2 kernel: drbd0: PingAck did not arrive in time.
Feb 26 23:27:36 node2 kernel: drbd0: peer( Primary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Feb 26 23:27:36 node2 kernel: drbd0: Creating new current UUID
Feb 26 23:27:36 node2 kernel: drbd0: asender terminated
Feb 26 23:27:36 node2 kernel: drbd0: short read expecting header on sock:
r=-512
Feb 26 23:27:36 node2 kernel: drbd0: tl_clear()
Feb 26 23:27:36 node2 kernel: drbd0: Connection closed
Feb 26 23:27:36 node2 kernel: drbd0: Writing meta data super block now.
Feb 26 23:27:36 node2 kernel: drbd0: conn( NetworkFailure -> Unconnected )
Feb 26 23:27:36 node2 kernel: drbd0: receiver terminated
Feb 26 23:27:36 node2 kernel: drbd0: receiver (re)started
Feb 26 23:27:36 node2 kernel: drbd0: conn( Unconnected -> WFConnection )
==> /var/log/messages <==
Feb 26 23:27:37 node2 last message repeated 3 times
Feb 26 23:27:40 node2 openais[3288]: [TOTEM] The token was lost in the
OPERATIONAL state.
Feb 26 23:27:40 node2 openais[3288]: [TOTEM] Receive multicast socket recv
buffer size (288000 bytes).
Feb 26 23:27:40 node2 openais[3288]: [TOTEM] Transmit multicast socket send
buffer size (262142 bytes).
Feb 26 23:27:40 node2 openais[3288]: [TOTEM] entering GATHER state from 2.
Feb 26 23:27:42 node2 root: Process did not exit cleanly, returned 2 with
signal 0
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] entering GATHER state from 0.
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] Creating commit token because I
am the rep.
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] Saving state aru 31 high seq
received 31
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] Storing new sequence id for ring
170
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] entering COMMIT state.
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] entering RECOVERY state.
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] position [0] member 192.168.0.2:
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] previous ring seq 364 rep
192.168.0.1
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] aru 31 high delivered 31 received
flag 1
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] Did not need to originate any
messages in recovery.
Feb 26 23:27:44 node2 openais[3288]: [TOTEM] Sending initial ORF token
Feb 26 23:27:44 node2 openais[3288]: [CLM ] CLM CONFIGURATION CHANGE
Feb 26 23:27:44 node2 openais[3288]: [CLM ] New Configuration:
Feb 26 23:27:44 node2 fenced[3307]: node1 not a cluster member after 0 sec
post_fail_delay
Feb 26 23:27:44 node2 openais[3288]: [CLM ] r(0) ip(192.168.0.2)
Feb 26 23:27:44 node2 fenced[3307]: fencing node "node1"
==> /var/log/kernel <==
Feb 26 23:27:44 node2 kernel: dlm: closing connection to node 1
==> /var/log/messages <==
Feb 26 23:27:44 node2 openais[3288]: [CLM ] Members Left:
Feb 26 23:27:45 node2 openais[3288]: [CLM ] r(0) ip(192.168.0.1)
Feb 26 23:27:45 node2 fence_manual: Node node1 needs to be reset before
recovery can procede. Waiting for node1 to rejoin the cluster or for manual
acknowledgement that it has been reset (i.e. fence_ack_manual -n node1)
Note this message...
Feb 26 23:27:45 node2 openais[3288]: [CLM ] Members Joined:
Feb 26 23:27:45 node2 openais[3288]: [CLM ] CLM CONFIGURATION CHANGE
Feb 26 23:27:45 node2 openais[3288]: [CLM ] New Configuration:
Feb 26 23:27:45 node2 openais[3288]: [CLM ] r(0) ip(192.168.0.2)
Feb 26 23:27:45 node2 openais[3288]: [CLM ] Members Left:
Feb 26 23:27:45 node2 openais[3288]: [CLM ] Members Joined:
Feb 26 23:27:45 node2 openais[3288]: [SYNC ] This node is within the primary
component and will provide service.
Feb 26 23:27:45 node2 openais[3288]: [TOTEM] entering OPERATIONAL state.
Feb 26 23:27:45 node2 openais[3288]: [CLM ] got nodejoin message 192.168.0.2
Feb 26 23:27:45 node2 openais[3288]: [CPG ] got joinlist message from node 2
Feb 26 23:27:47 node2 root: Process did not exit cleanly, returned 2 with
signal 0
-------------------------------------------------------------------------------------------------------------
ls /test works
BUT
touch /test/testfile hangs ....
cman_tool nodes shows
------------------------------------------------------------------------------------------------------------------
Node Sts Inc Joined Name
1 X 364 node1
2 M 360 2008-02-26 23:20:16 node2
-----------------------------------------------------------------------------------------------------------------
cman_tool status shows
-----------------------------------------------------------------------------------------------------------------
Version: 6.0.1
Config Version: 3
Cluster Name: cluster
Cluster Id: 34996
Cluster Member: Yes
Cluster Generation: 368
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Quorum: 1
Active subsystems: 6
Flags: 2node
Ports Bound: 0
Node name: node2
Node ID: 2
Multicast addresses: 239.192.136.61
Node addresses: 192.168.0.2
------------------------------------------------------------------------------------------------------------------
my drbd is no problem state is already primary (standalone)
Why can't i write to a gfs partition in the "lost Node" state ?
Now: i power node1 on !
drbd is no problem -> its recovered.
now i start cman
and my touch will be finished ....
Thanks for any ideas and help
-Thomas
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
This is because you are using manual fencing. Fencing is required to
ensure that an errant node does not continue to write to the shared
filesystem after it has lost communication with the cluster, thereby
corrupting the data. The only way to do this is to halt all cluster
activity (including granting GFS locks) until the fencing succeeds.
The "manual" means that an administrator must intervene and correct the
problem before cluster operations can resume. So when you power off
node1, node2 detects missed heartbeats and fences node1. Now you must
manually fence node1 by powering it off (this is already done in your
case) then do one of the following:
1) Run the following command to acknowledge that you have manually
fenced the node
# /sbin/fence_ack_manual node1
OR
2) Start node1 back up and have it rejoin the cluster
The danger with manual fencing comes in when you quickly run
fence_ack_manual without properly investigating the issue or fencing the
node. You may see that the fenced node is still up and quickly run that
command without noticing that the network connection has been lost. Now
the nodes proceed with writing to GFS without being able to communicate
and they quickly corrupt the data.
So, when using manual fencing always take caution before running
fence_ack_manual.
John
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster