Fencing the gulm master node: problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I want to test the fencing capability of GFS by unplugging the network on a node. But I experience some problems when the node I unplug is the gulm master.

I am using the RPM:
 - GFS-6.0.2.20-2
 - GFS-modules-smp-6.0.2.20-2

I have a 8-nodes cluster (sam21, sam22, ..., sam28). I mount a GFS filesystem on all nodes on /mnt/gfs

My config is:

----->8-------->8-------->8-------->8---
# fence.ccs
fence_devices {
        admin {
                agent="fence_manual"
        }
}

# cluster.ccs
cluster {
        name="sam"
        lock_gulm {
                servers=["sam21", "sam22", "sam23", "sam24", "sam25"]
        }
}

# nodes.ccs
nodes {
        sam21.toulouse {
                ip_interfaces {
                eth0 = "192.168.0.121"
                }
                fence {
                        human {
                                admin {
                                        ipaddr = "192.168.0.121"
                                }
                        }
                }
        }
# etc. for sam22 ... sam28
----->8-------->8-------->8-------->8---

I want to check that the unplugged node is fenced and its locks are released when I run "fence_ack_manual" (and only when I run fence_ack_manual", not before).

In order to know when the locks are released, I wrote a small program:

----->8-------->8-------->8-------->8---
// lock.c
fd = open("/mnt/gfs/lock-test.tmp", O_RDWR|O_CREAT, S_IREAD|S_IWRITE);
if (fd == -1) {
  printf("ERROR: open failed.\n");
  return 1;
}
error = flock(fd, LOCK_EX);
if (error == -1) {
  printf("ERROR: lock failed.\n");
  return 1;
}
while (1) {
  printf("writing... pid %d : %d\n", pid, counter++);
  buf[0]=0;
  p = sprintf(buf, "pid %d : %d\n", pid, counter);
  write(fd, buf, p);
  sleep(1);
}
----->8-------->8-------->8-------->8---


First test (which works):
- I run my lock.c program on sam26 (not a gulm server)
- The lock is acquired on sam26
- I run my lock.c program on all other nodes
- The other nodes wait for the lock
- I unplug sam26 and wait until the gulm master (sam21) want to fence sam26
- The gulm master (sam21) want me to run fence_ack_manual
- The lock is not taken on other node
- I run fence_ack_manual
- The lock is released on the unplugged node (sam26) and taken by another
=> So when I unplug a node which is not a gulm server, all work correctly.


Second test (which doesn't work):
- I run my lock.c program on sam21 (the gulm master)
- The lock is acquired on sam21
- I run my lock.c program on all other nodes
- The other nodes wait for the lock
- I unplug sam21 and wait until a new gulm master (sam22) want to fence the old master (sam21) - The new gulm master (sam22) want me to run fence_ack_manual BUT the lock is released immediately. I did not run fence_ack_manual and the lock is already released. This is my problem.

I read the bug reports [1][2] and the advisory RHBA-2005:466-11 [3] which says « Fixed a problem in which a gulm lock server ran on GFS clients after the master server died. » But I use GFS-6.0.2.20-2.

[1] https://bugzilla.redhat.com/beta/show_bug.cgi?id=148029
[2] https://bugzilla.redhat.com/beta/show_bug.cgi?id=149119
[3] http://rhn.redhat.com/errata/RHBA-2005-466.html

Is this a bug? Or a misunderstanding of the fencing mechanism?

The syslogs on the new gulm master (sam22) are:

----->8-------->8-------->8-------->8---
Jul 1 09:06:52 sam22 lock_gulmd_core[4195]: Failed to receive a timely heartbeat reply from Master. (t:1120201612489192 mb:1) Jul 1 09:07:07 sam22 lock_gulmd_core[4195]: Failed to receive a timely heartbeat reply from Master. (t:1120201627509192 mb:2) Jul 1 09:07:22 sam22 lock_gulmd_core[4195]: Failed to receive a timely heartbeat reply from Master. (t:1120201642529191 mb:3) Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: I see no Masters, So I am Arbitrating until enough Slaves talk to me. Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to slave sam23.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to slave sam28.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to slave sam26.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to slave sam25.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to slave sam24.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to slave sam22.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to slave sam27.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: LastMaster sam21.toulouse:192.168.0.121, is being marked Expired. Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership update "Expired" about sam21.toulouse to slave sam23.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership update "Expired" about sam21.toulouse to slave sam28.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership update "Expired" about sam21.toulouse to slave sam26.toulouse Jul 1 09:07:37 sam22 lock_gulmd_LTPX[4197]: New Master at sam22.toulouse:192.168.0.122 Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership update "Expired" about sam21.toulouse to slave sam25.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership update "Expired" about sam21.toulouse to slave sam24.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership update "Expired" about sam21.toulouse to slave sam22.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership update "Expired" about sam21.toulouse to slave sam27.toulouse Jul 1 09:07:37 sam22 lock_gulmd_core[4195]: Forked [4882] fence_node sam21.toulouse with a 0 pause. Jul 1 09:07:37 sam22 lock_gulmd_core[4882]: Gonna exec fence_node sam21.toulouse Jul 1 09:07:37 sam22 fence_node[4882]: Performing fence method, human, on sam21.toulouse. Jul 1 09:07:37 sam22 fence_manual: Node 192.168.0.121 requires hard reset. Run "fence_ack_manual -s 192.168.0.121" after power cycling the machine. Jul 1 09:07:38 sam22 lock_gulmd_core[4195]: Still in Arbitrating: Have 1, need 3 for quorum. Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Still in Arbitrating: Have 2, need 3 for quorum. Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: New Client: idx:5 fd:10 from (192.168.0.124:sam24.toulouse) Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Member update message Logged in about sam24.toulouse to sam23.toulouse is lost because node is in OM Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Member update message Logged in about sam24.toulouse to sam28.toulouse is lost because node is in OM Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Member update message Logged in about sam24.toulouse to sam25.toulouse is lost because node is in OM Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Member update message Logged in about sam24.toulouse to sam27.toulouse is lost because node is in OM Jul 1 09:07:39 sam22 lock_gulmd_LT000[4196]: Attached slave sam24.toulouse:192.168.0.124 idx:2 fd:7 (soff:3 connected:0x8) Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Still in Arbitrating: Have 2, need 3 for quorum. Jul 1 09:07:39 sam22 lock_gulmd_core[4195]: Still in Arbitrating: Have 2, need 3 for quorum. Jul 1 09:07:41 sam22 lock_gulmd_core[4195]: Now have Slave quorum, going full Master. Jul 1 09:07:41 sam22 lock_gulmd_core[4195]: New Client: idx:6 fd:11 from (192.168.0.123:sam23.toulouse) Jul 1 09:07:41 sam22 lock_gulmd_core[4195]: Member update message Logged in about sam23.toulouse to sam25.toulouse is lost because node is in OM Jul 1 09:07:41 sam22 lock_gulmd_LTPX[4197]: Logged into LT000 at sam22.toulouse:192.168.0.122 Jul 1 09:07:41 sam22 lock_gulmd_LT000[4196]: New Client: idx 3 fd 8 from (192.168.0.122:sam22.toulouse)
Jul  1 09:07:41 sam22 lock_gulmd_LTPX[4197]: Finished resending to LT000
Jul 1 09:07:41 sam22 lock_gulmd_LT000[4196]: Attached slave sam23.toulouse:192.168.0.123 idx:4 fd:9 (soff:2 connected:0xc) Jul 1 09:07:41 sam22 lock_gulmd_LT000[4196]: New Client: idx 5 fd 10 from (192.168.0.123:sam23.toulouse) Jul 1 09:07:41 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Trying to acquire journal lock... Jul 1 09:07:41 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Looking at journal... Jul 1 09:07:41 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Acquiring the transaction lock... Jul 1 09:07:42 sam22 lock_gulmd_LT000[4196]: New Client: idx 6 fd 11 from (192.168.0.124:sam24.toulouse) Jul 1 09:07:45 sam22 lock_gulmd_core[4195]: New Client: idx:3 fd:7 from (192.168.0.126:sam26.toulouse) Jul 1 09:07:45 sam22 lock_gulmd_core[4195]: Member update message Logged in about sam26.toulouse to sam25.toulouse is lost because node is in OM Jul 1 09:07:45 sam22 lock_gulmd_LT000[4196]: New Client: idx 7 fd 12 from (192.168.0.126:sam26.toulouse) Jul 1 09:07:46 sam22 lock_gulmd_core[4195]: New Client: idx:7 fd:12 from (192.168.0.128:sam28.toulouse) Jul 1 09:07:46 sam22 lock_gulmd_core[4195]: Member update message Logged in about sam28.toulouse to sam25.toulouse is lost because node is in OM Jul 1 09:07:46 sam22 lock_gulmd_LT000[4196]: New Client: idx 8 fd 13 from (192.168.0.128:sam28.toulouse) Jul 1 09:07:47 sam22 lock_gulmd_core[4195]: New Client: idx:8 fd:13 from (192.168.0.125:sam25.toulouse) Jul 1 09:07:47 sam22 lock_gulmd_LT000[4196]: Attached slave sam25.toulouse:192.168.0.125 idx:9 fd:14 (soff:1 connected:0xe) Jul 1 09:07:47 sam22 lock_gulmd_LT000[4196]: New Client: idx 10 fd 15 from (192.168.0.125:sam25.toulouse) Jul 1 09:07:49 sam22 lock_gulmd_core[4195]: New Client: idx:9 fd:14 from (192.168.0.127:sam27.toulouse) Jul 1 09:07:49 sam22 lock_gulmd_LT000[4196]: New Client: idx 11 fd 16 from (192.168.0.127:sam27.toulouse) Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Replaying journal... Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Replayed 0 of 2 blocks Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: replays = 0, skips = 1, sames = 1 Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Journal replayed in 9s
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Done
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=7: Trying to acquire journal lock...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=7: Busy
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=6: Trying to acquire journal lock...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=6: Busy
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=5: Trying to acquire journal lock...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=5: Busy
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=7: Trying to acquire journal lock...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=7: Busy
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=6: Trying to acquire journal lock...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=6: Busy
Jul 1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=5: Trying to acquire journal lock...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=5: Busy
----->8-------->8-------->8-------->8---

Sincerely,

Alban Crequy

--

Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux