Hello,
I want to test the fencing capability of GFS by unplugging the network on a
node. But I experience some problems when the node I unplug is the gulm master.
I am using the RPM:
- GFS-6.0.2.20-2
- GFS-modules-smp-6.0.2.20-2
I have a 8-nodes cluster (sam21, sam22, ..., sam28). I mount a GFS
filesystem on all nodes on /mnt/gfs
My config is:
----->8-------->8-------->8-------->8---
# fence.ccs
fence_devices {
admin {
agent="fence_manual"
}
}
# cluster.ccs
cluster {
name="sam"
lock_gulm {
servers=["sam21", "sam22", "sam23", "sam24", "sam25"]
}
}
# nodes.ccs
nodes {
sam21.toulouse {
ip_interfaces {
eth0 = "192.168.0.121"
}
fence {
human {
admin {
ipaddr = "192.168.0.121"
}
}
}
}
# etc. for sam22 ... sam28
----->8-------->8-------->8-------->8---
I want to check that the unplugged node is fenced and its locks are released
when I run "fence_ack_manual" (and only when I run fence_ack_manual", not
before).
In order to know when the locks are released, I wrote a small program:
----->8-------->8-------->8-------->8---
// lock.c
fd = open("/mnt/gfs/lock-test.tmp", O_RDWR|O_CREAT, S_IREAD|S_IWRITE);
if (fd == -1) {
printf("ERROR: open failed.\n");
return 1;
}
error = flock(fd, LOCK_EX);
if (error == -1) {
printf("ERROR: lock failed.\n");
return 1;
}
while (1) {
printf("writing... pid %d : %d\n", pid, counter++);
buf[0]=0;
p = sprintf(buf, "pid %d : %d\n", pid, counter);
write(fd, buf, p);
sleep(1);
}
----->8-------->8-------->8-------->8---
First test (which works):
- I run my lock.c program on sam26 (not a gulm server)
- The lock is acquired on sam26
- I run my lock.c program on all other nodes
- The other nodes wait for the lock
- I unplug sam26 and wait until the gulm master (sam21) want to fence sam26
- The gulm master (sam21) want me to run fence_ack_manual
- The lock is not taken on other node
- I run fence_ack_manual
- The lock is released on the unplugged node (sam26) and taken by another
=> So when I unplug a node which is not a gulm server, all work correctly.
Second test (which doesn't work):
- I run my lock.c program on sam21 (the gulm master)
- The lock is acquired on sam21
- I run my lock.c program on all other nodes
- The other nodes wait for the lock
- I unplug sam21 and wait until a new gulm master (sam22) want to fence the
old master (sam21)
- The new gulm master (sam22) want me to run fence_ack_manual BUT the lock
is released immediately. I did not run fence_ack_manual and the lock is
already released. This is my problem.
I read the bug reports [1][2] and the advisory RHBA-2005:466-11 [3] which
says « Fixed a problem in which a gulm lock server ran on GFS clients after
the master server died. » But I use GFS-6.0.2.20-2.
[1] https://bugzilla.redhat.com/beta/show_bug.cgi?id=148029
[2] https://bugzilla.redhat.com/beta/show_bug.cgi?id=149119
[3] http://rhn.redhat.com/errata/RHBA-2005-466.html
Is this a bug? Or a misunderstanding of the fencing mechanism?
The syslogs on the new gulm master (sam22) are:
----->8-------->8-------->8-------->8---
Jul 1 09:06:52 sam22 lock_gulmd_core[4195]: Failed to receive a timely
heartbeat reply from Master. (t:1120201612489192 mb:1)
Jul 1 09:07:07 sam22 lock_gulmd_core[4195]: Failed to receive a timely
heartbeat reply from Master. (t:1120201627509192 mb:2)
Jul 1 09:07:22 sam22 lock_gulmd_core[4195]: Failed to receive a timely
heartbeat reply from Master. (t:1120201642529191 mb:3)
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: I see no Masters, So I am
Arbitrating until enough Slaves talk to me.
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to
slave sam23.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to
slave sam28.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to
slave sam26.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to
slave sam25.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to
slave sam24.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to
slave sam22.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to
slave sam27.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: LastMaster
sam21.toulouse:192.168.0.121, is being marked Expired.
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership
update "Expired" about sam21.toulouse to slave sam23.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership
update "Expired" about sam21.toulouse to slave sam28.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership
update "Expired" about sam21.toulouse to slave sam26.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_LTPX[4197]: New Master at
sam22.toulouse:192.168.0.122
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership
update "Expired" about sam21.toulouse to slave sam25.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership
update "Expired" about sam21.toulouse to slave sam24.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership
update "Expired" about sam21.toulouse to slave sam22.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership
update "Expired" about sam21.toulouse to slave sam27.toulouse
Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Forked [4882] fence_node
sam21.toulouse with a 0 pause.
Jul 1 09:07:37 sam22 lock_gulmd_core[4882]: Gonna exec fence_node
sam21.toulouse
Jul 1 09:07:37 sam22 fence_node[4882]: Performing fence method, human, on
sam21.toulouse.
Jul 1 09:07:37 sam22 fence_manual: Node 192.168.0.121 requires hard reset.
Run "fence_ack_manual -s 192.168.0.121" after power cycling the machine.
Jul 1 09:07:38 sam22 lock_gulmd_core[4195]: Still in Arbitrating: Have 1,
need 3 for quorum.
Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Still in Arbitrating: Have 2,
need 3 for quorum.
Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: New Client: idx:5 fd:10 from
(192.168.0.124:sam24.toulouse)
Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Member update message Logged in
about sam24.toulouse to sam23.toulouse is lost because node is in OM
Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Member update message Logged in
about sam24.toulouse to sam28.toulouse is lost because node is in OM
Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Member update message Logged in
about sam24.toulouse to sam25.toulouse is lost because node is in OM
Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Member update message Logged in
about sam24.toulouse to sam27.toulouse is lost because node is in OM
Jul 1 09:07:39 sam22 lock_gulmd_LT000[4196]: Attached slave
sam24.toulouse:192.168.0.124 idx:2 fd:7 (soff:3 connected:0x8)
Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Still in Arbitrating: Have 2,
need 3 for quorum.
Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Still in Arbitrating: Have 2,
need 3 for quorum.
Jul 1 09:07:41 sam22 lock_gulmd_core[4195]: Now have Slave quorum, going
full Master.
Jul 1 09:07:41 sam22 lock_gulmd_core[4195]: New Client: idx:6 fd:11 from
(192.168.0.123:sam23.toulouse)
Jul 1 09:07:41 sam22 lock_gulmd_core[4195]: Member update message Logged in
about sam23.toulouse to sam25.toulouse is lost because node is in OM
Jul 1 09:07:41 sam22 lock_gulmd_LTPX[4197]: Logged into LT000 at
sam22.toulouse:192.168.0.122
Jul 1 09:07:41 sam22 lock_gulmd_LT000[4196]: New Client: idx 3 fd 8 from
(192.168.0.122:sam22.toulouse)
Jul 1 09:07:41 sam22 lock_gulmd_LTPX[4197]: Finished resending to LT000
Jul 1 09:07:41 sam22 lock_gulmd_LT000[4196]: Attached slave
sam23.toulouse:192.168.0.123 idx:4 fd:9 (soff:2 connected:0xc)
Jul 1 09:07:41 sam22 lock_gulmd_LT000[4196]: New Client: idx 5 fd 10 from
(192.168.0.123:sam23.toulouse)
Jul 1 09:07:41 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Trying to
acquire journal lock...
Jul 1 09:07:41 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Looking at
journal...
Jul 1 09:07:41 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Acquiring the
transaction lock...
Jul 1 09:07:42 sam22 lock_gulmd_LT000[4196]: New Client: idx 6 fd 11 from
(192.168.0.124:sam24.toulouse)
Jul 1 09:07:45 sam22 lock_gulmd_core[4195]: New Client: idx:3 fd:7 from
(192.168.0.126:sam26.toulouse)
Jul 1 09:07:45 sam22 lock_gulmd_core[4195]: Member update message Logged in
about sam26.toulouse to sam25.toulouse is lost because node is in OM
Jul 1 09:07:45 sam22 lock_gulmd_LT000[4196]: New Client: idx 7 fd 12 from
(192.168.0.126:sam26.toulouse)
Jul 1 09:07:46 sam22 lock_gulmd_core[4195]: New Client: idx:7 fd:12 from
(192.168.0.128:sam28.toulouse)
Jul 1 09:07:46 sam22 lock_gulmd_core[4195]: Member update message Logged in
about sam28.toulouse to sam25.toulouse is lost because node is in OM
Jul 1 09:07:46 sam22 lock_gulmd_LT000[4196]: New Client: idx 8 fd 13 from
(192.168.0.128:sam28.toulouse)
Jul 1 09:07:47 sam22 lock_gulmd_core[4195]: New Client: idx:8 fd:13 from
(192.168.0.125:sam25.toulouse)
Jul 1 09:07:47 sam22 lock_gulmd_LT000[4196]: Attached slave
sam25.toulouse:192.168.0.125 idx:9 fd:14 (soff:1 connected:0xe)
Jul 1 09:07:47 sam22 lock_gulmd_LT000[4196]: New Client: idx 10 fd 15 from
(192.168.0.125:sam25.toulouse)
Jul 1 09:07:49 sam22 lock_gulmd_core[4195]: New Client: idx:9 fd:14 from
(192.168.0.127:sam27.toulouse)
Jul 1 09:07:49 sam22 lock_gulmd_LT000[4196]: New Client: idx 11 fd 16 from
(192.168.0.127:sam27.toulouse)
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Replaying
journal...
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Replayed 0 of
2 blocks
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: replays = 0,
skips = 1, sames = 1
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Journal
replayed in 9s
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Done
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=7: Trying to
acquire journal lock...
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=7: Busy
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=6: Trying to
acquire journal lock...
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=6: Busy
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=5: Trying to
acquire journal lock...
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=5: Busy
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=7: Trying to
acquire journal lock...
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=7: Busy
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=6: Trying to
acquire journal lock...
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=6: Busy
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=5: Trying to
acquire journal lock...
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=5: Busy
----->8-------->8-------->8-------->8---
Sincerely,
Alban Crequy
--
Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster