[Linux-cluster] cluster latest cvs does not fence dead nodes automatically

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I'm building two-node cluster using today's cvs from sources.redhat.com:/cvs/cluster.
Shared storage is located on FC shared disk.
All work as expected up to using gfs.


When I simulated a node crash (I did ifcfg eth0 down on node 2),
node 1 simply says (on syslog):

Feb 15 13:33:35 hosting-cl02-01 CMAN: removing node hosting-cl02-02 from the cluster : Missed too many heartbeats

However, NO fencing occured. Not even a "fence failed" message. I use fence_ibmblade.
After that, access to gfs device blocked (df -k still works though), and /proc/cluster/nodes show


Node  Votes Exp Sts  Name
  1    1    1   M   node-01
  2    1    1   X   node-02

there's an "X" on node 2, but /proc/cluster/service shows

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 2]

DLM Lock Space:  "data"                              3   4 run       -
[1 2]

DLM Lock Space:  "config"                            5   6 run       -
[1 2]

GFS Mount Group: "data"                              4   5 run       -
[1 2]

GFS Mount Group: "config"                            6   7 run       -
[1 2]

which is the same content with before node 2 is dead.
AFAIK, state should be "recover" or "waiting to recover" instead of run.

If I reboot node 2 (which is the same thing if you exceute
fence_ibmblade manually), and restart cluster services on that node, all is back to normal,
and these messages show on syslog :


Feb 15 13:38:40 node-01 CMAN: node node-02 rejoining
Feb 15 13:38:40 node-01 fenced[25486]: node-02 not a cluster member after 0 sec post_fail_delay
Feb 15 13:38:42 node-01 GFS: fsid=node:config.0: jid=1: Trying to acquire journal lock...
Feb 15 13:38:42 node-01 GFS: fsid=node:data.0: jid=1: Trying to acquire journal lock...
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Looking at journal...
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Looking at journal...
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Acquiring the transaction lock...
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Replaying journal...
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Replayed 0 of 0 blocks
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: replays = 0, skips = 0, sames = 0
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Acquiring the transaction lock...
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Replaying journal...
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Replayed 0 of 0 blocks
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: replays = 0, skips = 0, sames = 0
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Journal replayed in 1s
Feb 15 13:38:43 node-01 GFS: fsid=node:data.0: jid=1: Done
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Journal replayed in 1s
Feb 15 13:38:43 node-01 GFS: fsid=node:config.0: jid=1: Done


Any idea what's wrong?

Regards,

Fajar





[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux