Hi,
I'm building two-node cluster using today's cvs from sources.redhat.com:/cvs/cluster.
Shared storage is located on FC shared disk.
All work as expected up to using gfs.
When I simulated a node crash (I did ifcfg eth0 down on node 2), node 1 simply says (on syslog):
Feb 15 13:33:35 hosting-cl02-01 CMAN: removing node hosting-cl02-02 from the cluster : Missed too many heartbeats
However, NO fencing occured. Not even a "fence failed" message. I use fence_ibmblade.
After that, access to gfs device blocked (df -k still works though), and /proc/cluster/nodes show
Node Votes Exp Sts Name 1 1 1 M node-01 2 1 1 X node-02
there's an "X" on node 2, but /proc/cluster/service shows
Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2]
DLM Lock Space: "clvmd" 2 3 run - [1 2]
DLM Lock Space: "data" 3 4 run - [1 2]
DLM Lock Space: "config" 5 6 run - [1 2]
GFS Mount Group: "data" 4 5 run - [1 2]
GFS Mount Group: "config" 6 7 run - [1 2]
which is the same content with before node 2 is dead. AFAIK, state should be "recover" or "waiting to recover" instead of run.
If I reboot node 2 (which is the same thing if you exceute
fence_ibmblade manually), and restart cluster services on that node, all is back to normal,
and these messages show on syslog :
Feb 15 13:38:40 node-01 CMAN: node node-02 rejoining
Feb 15 13:38:40 node-01 fenced[25486]: node-02 not a cluster member after 0 sec post_fail_delay
Feb 15 13:38:42 node-01 GFS: fsid=node:config.0: jid=1: Trying to acquire journal lock...
Feb 15 13:38:42 node-01 GFS: fsid=node:data.0: jid=1: Trying to acquire journal lock...
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Looking at journal...
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Looking at journal...
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Acquiring the transaction lock...
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Replaying journal...
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Replayed 0 of 0 blocks
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: replays = 0, skips = 0, sames = 0
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Acquiring the transaction lock...
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Replaying journal...
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Replayed 0 of 0 blocks
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: replays = 0, skips = 0, sames = 0
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Journal replayed in 1s
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Done
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Journal replayed in 1s
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Done
Any idea what's wrong?
Regards,
Fajar