All, I'm trying to get a two-node Debian cluster up-and-running with GFS. The Linux-machines themselves are running on top op VMWare on blade systems. I can succesfully start all required subsystems (css, cman, fenced) and am able to mount the GFS filesystem on both nodes. After some time however (usually in the middle of the night - cause unknown but it might have something to do with clock-drift on the VM's) one of the nodes determines that the heartbeat is late and kicks the node out of the cluster. On the 'reporting' node: Mar 17 03:58:34 srv129bh heartbeat[2722]: WARN: Late heartbeat: Node srv128bh: interval 15150 ms Mar 17 03:58:40 srv129bh kernel: CMAN: removing node srv128bh from the cluster : Missed too many heartbeats On the 'failing' node: Mar 17 03:58:24 srv128bh kernel: CMAN: Being told to leave the cluster by node 2 Mar 17 03:58:24 srv128bh kernel: CMAN: we are leaving the cluster. Mar 17 03:58:24 srv128bh kernel: WARNING: dlm_emergency_shutdown Mar 17 03:58:24 srv128bh kernel: WARNING: dlm_emergency_shutdown Note the approx 15 second delay that may indicate a clock-drift between the nodes. Some hours later I am getting this error on the 'failing' node: Mar 17 06:26:25 srv128bh kernel: dlm: dlm_lock: no lockspace Mar 17 06:26:25 srv128bh kernel: s_1 rebuild resource directory Mar 17 06:26:25 srv128bh kernel: gfs_1 rebuilt 16 resources Mar 17 06:26:25 srv128bh kernel: gfs_1 purge requests Mar 17 06:26:25 srv128bh kernel: gfs_1 purged 0 requests Mar 17 06:26:25 srv128bh kernel: gfs_1 mark waiting requests Mar 17 06:26:25 srv128bh kernel: gfs_1 marked 0 requests Mar 17 06:26:25 srv128bh kernel: gfs_1 purge locks of departed nodes Mar 17 06:26:25 srv128bh kernel: gfs_1 purged 14 locks Mar 17 06:26:25 srv128bh kernel: gfs_1 update remastered resources Mar 17 06:26:25 srv128bh kernel: gfs_1 updated 1 resources Mar 17 06:26:25 srv128bh kernel: gfs_1 rebuild locks Mar 17 06:26:25 srv128bh kernel: gfs_1 rebuilt 0 locks Mar 17 06:26:25 srv128bh kernel: gfs_1 recover event 8 done Mar 17 06:26:25 srv128bh kernel: gfs_1 move flags 0,0,1 ids 6,8,8 Mar 17 06:26:25 srv128bh kernel: gfs_1 process held requests Mar 17 06:26:25 srv128bh kernel: gfs_1 processed 0 requests Mar 17 06:26:25 srv128bh kernel: gfs_1 resend marked requests Mar 17 06:26:25 srv128bh kernel: gfs_1 resent 0 requests Mar 17 06:26:25 srv128bh kernel: gfs_1 recover event 8 finished Mar 17 06:26:25 srv128bh kernel: gfs_1 move flags 1,0,0 ids 8,8,8 Mar 17 06:26:25 srv128bh kernel: gfs_1 move flags 0,1,0 ids 8,10,8 Mar 17 06:26:25 srv128bh kernel: gfs_1 move use event 10 Mar 17 06:26:25 srv128bh kernel: gfs_1 recover event 10 Mar 17 06:26:25 srv128bh kernel: gfs_1 add node 2 Mar 17 06:26:25 srv128bh kernel: gfs_1 total nodes 2 Mar 17 06:26:25 srv128bh kernel: gfs_1 rebuild resource directory Mar 17 06:26:25 srv128bh kernel: gfs_1 rebuilt 9 resources Mar 17 06:26:25 srv128bh kernel: gfs_1 purge requests Mar 17 06:26:25 srv128bh kernel: gfs_1 purged 0 requests Mar 17 06:26:25 srv128bh kernel: gfs_1 mark waiting requests Mar 17 06:26:25 srv128bh kernel: gfs_1 marked 0 requests Mar 17 06:26:25 srv128bh kernel: gfs_1 recover event 10 done Mar 17 06:26:25 srv128bh kernel: gfs_1 move flags 0,0,1 ids 8,10,10 Mar 17 06:26:25 srv128bh kernel: gfs_1 process held requests Mar 17 06:26:25 srv128bh kernel: gfs_1 processed 0 requests Mar 17 06:26:25 srv128bh kernel: gfs_1 resend marked requests Mar 17 06:26:25 srv128bh kernel: gfs_1 resent 0 requests Mar 17 06:26:25 srv128bh kernel: gfs_1 recover event 10 finished Mar 17 06:26:25 srv128bh kernel: 3259 pr_start last_stop 0 last_start 4 last_finish 0 Mar 17 06:26:25 srv128bh kernel: 3259 pr_start count 1 type 2 event 4 flags 250 Mar 17 06:26:25 srv128bh kernel: 3259 claim_jid 0 Mar 17 06:26:25 srv128bh kernel: 3259 pr_start 4 done 1 Mar 17 06:26:25 srv128bh kernel: 3259 pr_finish flags 5b Mar 17 06:26:25 srv128bh kernel: 3250 recovery_done jid 0 msg 309 b Mar 17 06:26:25 srv128bh kernel: 3250 recovery_done nodeid 1 flg 18 Mar 17 06:26:25 srv128bh kernel: 3250 recovery_done jid 1 msg 309 b Mar 17 06:26:25 srv128bh kernel: 3250 others_may_mount b Mar 17 06:26:25 srv128bh kernel: 3258 pr_start last_stop 4 last_start 7 last_finish 4 Mar 17 06:26:25 srv128bh kernel: 3258 pr_start count 2 type 2 event 7 flags a1b Mar 17 06:26:25 srv128bh kernel: 3258 pr_start 7 done 1 Mar 17 06:26:25 srv128bh kernel: 3258 pr_finish flags 81b Mar 17 06:26:25 srv128bh kernel: 3259 pr_start last_stop 7 last_start 8 last_finish 7 Mar 17 06:26:25 srv128bh kernel: 3259 pr_start count 1 type 1 event 8 flags a1b Mar 17 06:26:25 srv128bh kernel: 3259 pr_start cb jid 1 id 2 Mar 17 06:26:25 srv128bh kernel: 3259 pr_start 8 done 0 Mar 17 06:26:25 srv128bh kernel: 3262 recovery_done jid 1 msg 309 91b Mar 17 06:26:25 srv128bh kernel: 3262 recovery_done nodeid 2 flg 1b Mar 17 06:26:25 srv128bh kernel: 3262 recovery_done start_done 8 Mar 17 06:26:25 srv128bh kernel: 3258 pr_finish flags 81b Mar 17 06:26:25 srv128bh kernel: 3259 pr_start last_stop 8 last_start 11 last_finish 8 Mar 17 06:26:25 srv128bh kernel: 3259 pr_start count 2 type 2 event 11 flags a1b Mar 17 06:26:25 srv128bh kernel: 3259 pr_start 11 done 1 Mar 17 06:26:25 srv128bh kernel: 3259 pr_finish flags 81b Mar 17 06:26:25 srv128bh kernel: Mar 17 06:26:25 srv128bh kernel: lock_dlm: Assertion failed on line 428 of file /home/jnhon/redhat-cluster/linux-modules-extra-2.6-2.6.18/debian/build/build_i386_none_k7_redhat-cluster/gfs/dlm/lock.c Mar 17 06:26:25 srv128bh kernel: lock_dlm: assertion: "!error" Mar 17 06:26:25 srv128bh kernel: lock_dlm: time = 12311916 Mar 17 06:26:25 srv128bh kernel: gfs_1: num=2,17 err=-22 cur=-1 req=3 lkf=10000 Mar 17 06:26:25 srv128bh kernel: Mar 17 06:26:25 srv128bh kernel: Modules linked in: ip_vs_wlc ip_vs button ac battery vmmemctl ipv6 lock_nolock lock_dlm dlm gfs lock_harness cman loop tsdev parport_pc parport intel_agp floppy i2c_piix4 psmouse agpgart shpchp pci_hotplug i2c_core serio_raw evdev rtc pcspkr ext3 jbd mbcache dm_mirror dm_snapshot dm_mod sd_mod ide_cd cdrom pcnet32 mii mptspi mptscsih mptbase scsi_transport_spi scsi_mod piix generic ide_core thermal processor fan Mar 17 06:26:25 srv128bh kernel: EIP: 0060:[<f8c2f4c2>] Tainted: P VLI Mar 17 06:26:25 srv128bh kernel: EFLAGS: 00010296 (2.6.18-4-k7 #1) Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: ------------[ cut here ]------------ Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: kernel BUG at /home/jnhon/redhat-cluster/linux-modules-extra-2.6-2.6.18/debian/build/build_i386_none_k7_redhat-cluster/gfs/dlm/lock.c:428! Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: esi: dfdd5080 edi: f28f01c0 ebp: 00000001 esp: f4feddd8 Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: ds: 007b es: 007b ss: 0068 Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: Process find (pid: 5178, ti=f4fec000 task=c214caa0 task.ti=f4fec000) Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: Stack: f8c33955 00000000 202001c0 20202020 20203220 20202020 20202020 20202020 Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: 00183731 f28f01c0 00000000 00000003 f28f01c0 f8c2f61b 00000003 f8c36c60 Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: invalid opcode: 0000 [#1] Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: SMP Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: CPU: 0 Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: EIP is at do_dlm_lock+0x133/0x14d [lock_dlm] Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: f8cf0000 00000000 f8cc5662 00000008 f5069468 f8cf0000 00000008 00000003 Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: Call Trace: Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: Code: 2c 50 0f bf 47 24 50 53 ff 77 08 ff 77 04 ff 77 0c ff 76 18 68 c2 39 c3 f8 e8 b8 e5 4e c7 83 c4 38 68 55 39 c3 f8 e8 ab e5 4e c7 <0f> 0b ac 01 4d 38 c3 f8 68 57 39 c3 f8 e8 d4 db 4e c7 83 c4 20 Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: EIP: [<f8c2f4c2>] do_dlm_lock+0x133/0x14d [lock_dlm] SS:ESP 0068:f4feddd8 Message from syslogd@srv128bh at Sat Mar 17 06:26:25 2007 ... srv128bh kernel: eax: 00000004 ebx: ffffffea ecx: ffffffff edx: fffff696 Mar 17 06:26:25 srv128bh kernel: [<f8c2f61b>] lm_dlm_lock+0xb6/0xc0 [lock_dlm] Mar 17 06:26:25 srv128bh kernel: [<f8cc5662>] gfs_lm_lock+0x35/0x4d [gfs] Mar 17 06:26:25 srv128bh kernel: [<f8cbe220>] gfs_glock_xmote_th+0x113/0x151 [gfs] Mar 17 06:26:25 srv128bh kernel: [<f8cbccd6>] run_queue+0x221/0x2e4 [gfs] Mar 17 06:26:25 srv128bh kernel: [<f8cbd329>] gfs_glock_nq+0x353/0x3a4 [gfs] Mar 17 06:26:25 srv128bh kernel: [<f8cbd38d>] gfs_glock_nq_init+0x13/0x26 [gfs] Mar 17 06:26:25 srv128bh kernel: [<f8ccfb17>] gfs_getattr+0x37/0x58 [gfs] Mar 17 06:26:25 srv128bh kernel: [<f8ccfae0>] gfs_getattr+0x0/0x58 [gfs] Mar 17 06:26:25 srv128bh kernel: [<c016240b>] vfs_getattr+0x40/0x99 Mar 17 06:26:25 srv128bh kernel: [<c016248b>] vfs_lstat_fd+0x27/0x39 Mar 17 06:26:25 srv128bh kernel: [<c01592df>] do_filp_open+0x2b/0x31 Mar 17 06:26:25 srv128bh kernel: [<c01624e2>] sys_lstat64+0xf/0x23 Mar 17 06:26:25 srv128bh kernel: [<c016e2cb>] dput+0x1a/0x119 Mar 17 06:26:25 srv128bh kernel: [<c015b4fa>] __fput+0x11c/0x13f Mar 17 06:26:25 srv128bh kernel: [<c01717c2>] mntput_no_expire+0x11/0x68 Mar 17 06:26:25 srv128bh kernel: [<c0158fa1>] filp_close+0x4e/0x54 As soon as I now touch the filesystem on the 'reporting' node it simply freezes (it's pingable but any filesystem-access seems to hang). One of the things that I'm experiencing is that fenced never gets 'invoked'. I've (for now) setup manual fencing and have tried with others but it appears that the failing node never gets properly fenced. My questions are: 1. Can clock-drift between the nodes cause the 'late heartbeat'? 2. Under what circumstances doesn't a node get fenced? For completeness, my cluster.conf: <?xml version="1.0"?> <cluster name="gfs_ontwikkel" config_version="1"> <cman two_node="1" expected_votes="1"></cman>
<fencedevices> <fencedevice name="manual" agent="fence_manual"/> </fencedevices>
<clusternodes> <clusternode name="srv128bh" votes="1" nodeid="srv128bh"> <fence> <method name="single"> <device name="manual"/> </method> </fence> </clusternode>
<clusternode name="srv129bh" votes="1" nodeid="srv129bh"> <fence> <method name="single"> <device name="manual"/> </method> </fence> </clusternode> </clusternodes>
<rm> <failoverdomains/> <resources/> </rm> </cluster> Thanks in advance, Jeroen |
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster