[Linux-cluster] cluster latest cvs does not fence dead nodes automatically

"Fajar A. Nugraha" <fajar@xxxxxxxxxxxx> · Tue, 15 Feb 2005 13:52:21 +0700

Hi,

I'm building two-node cluster using today's cvs from 
sources.redhat.com:/cvs/cluster.

Shared storage is located on FC shared disk.

All work as expected up to using gfs.

When I simulated a node crash (I did ifcfg eth0 down on node 2),
node 1 simply says (on syslog):

Feb 15 13:33:35 hosting-cl02-01 CMAN: removing node hosting-cl02-02 from 
the cluster : Missed too many heartbeats

However, NO fencing occured. Not even a "fence failed" message. I use 
fence_ibmblade.

After that, access to gfs device blocked (df -k still works though), and 
/proc/cluster/nodes show

Node  Votes Exp Sts  Name
  1    1    1   M   node-01
  2    1    1   X   node-02

there's an "X" on node 2, but /proc/cluster/service shows

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 2]

DLM Lock Space:  "data"                              3   4 run       -
[1 2]

DLM Lock Space:  "config"                            5   6 run       -
[1 2]

GFS Mount Group: "data"                              4   5 run       -
[1 2]

GFS Mount Group: "config"                            6   7 run       -
[1 2]

which is the same content with before node 2 is dead.
AFAIK, state should be "recover" or "waiting to recover" instead of run.

If I reboot node 2 (which is the same thing if you exceute

fence_ibmblade manually), and restart cluster services on that node, all 
is back to normal,

and these messages show on syslog :

Feb 15 13:38:40 node-01 CMAN: node node-02 rejoining

Feb 15 13:38:40 node-01 fenced[25486]: node-02 not a cluster member 
after 0 sec post_fail_delay

Feb 15 13:38:42 node-01 GFS: fsid=node:config.0: jid=1: Trying to 
acquire journal lock...

Feb 15 13:38:42 node-01 GFS: fsid=node:data.0: jid=1: Trying to acquire 
journal lock...

Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Looking at 
journal...

Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Looking at journal...

Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Acquiring the 
transaction lock...

Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Replaying journal...

Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Replayed 0 of 0 
blocks

Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: replays = 0, 
skips = 0, sames = 0

Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Acquiring the 
transaction lock...

Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Replaying journal...

Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Replayed 0 of 0 blocks

Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: replays = 0, skips 
= 0, sames = 0

Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Journal replayed in 1s

Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Done

Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Journal replayed 
in 1s

Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Done

Any idea what's wrong?

Regards,

Fajar