ok, I was just logging into the 2 nodes of my cluster, tf1 and tf2, I noticed that tf1 was NOT available via ssh, but tf2 was. tf1 was pingable, but that was it. I looked on tft2 and noticed that he had taken over the cluster virtual ip address 2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:11:43:d7:c9:c6 brd ff:ff:ff:ff:ff:ff inet 192.168.1.6/24 brd 192.168.1.255 scope global eth0 inet 192.168.1.7/32 scope global eth0 inet6 fe80::211:43ff:fed7:c9c6/64 scope link valid_lft forever preferred_lft forever and in the syslog on tf2, I saw Oct 25 20:26:00 tf2 kernel: CMAN: removing node tf1 from the cluster : Missed too many heartbeats Oct 25 20:26:00 tf2 fenced[4091]: tf1 not a cluster member after 0 sec post_fail_delay Oct 25 20:26:00 tf2 fenced[4091]: fencing node "tf1" Oct 25 20:26:04 tf2 kernel: e100: eth2: e100_watchdog: link down Oct 25 20:26:08 tf2 fenced[4091]: fence "tf1" success Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Trying to acquire journal lock... Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Looking at journal... Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Acquiring the transaction lock... Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Replaying journal... Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Replayed 0 of 11 blocks Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: replays = 0, skips = 0, sames = 11 Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Journal replayed in 1s Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Done Oct 25 20:26:27 tf2 clurgmgrd[4903]: <info> Magma Event: Membership Change Oct 25 20:26:27 tf2 clurgmgrd[4903]: <info> State change: tf1 DOWN Oct 25 20:26:27 tf2 clurgmgrd[4903]: <notice> Starting stopped service Apache Service Oct 25 20:26:29 tf2 httpd: httpd startup succeeded Oct 25 20:26:29 tf2 clurgmgrd[4903]: <notice> Service Apache Service started Oct 25 20:26:36 tf2 kernel: e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex Oct 25 20:28:08 tf2 kernel: e100: eth2: e100_watchdog: link down Oct 25 20:28:10 tf2 kernel: e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex Oct 25 20:29:40 tf2 kernel: CMAN: node tf1 rejoining so i noticed that after a few more mins, tf1 *appeared* to be rebooting, and I saw this in the syslog of tf2 Oct 25 20:34:25 tf2 kernel: CMAN: too many transition restarts - will die Oct 25 20:34:25 tf2 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view Oct 25 20:34:25 tf2 kernel: WARNING: dlm_emergency_shutdown Oct 25 20:34:25 tf2 clurgmgrd[4903]: <warning> #67: Shutting down uncleanly Oct 25 20:34:25 tf2 kernel: WARNING: dlm_emergency_shutdown Oct 25 20:34:25 tf2 kernel: SM: 00000001 sm_stop: SG still joined Oct 25 20:34:25 tf2 kernel: SM: 01000002 sm_stop: SG still joined Oct 25 20:34:25 tf2 kernel: SM: 02000004 sm_stop: SG still joined Oct 25 20:34:25 tf2 kernel: SM: 03000005 sm_stop: SG still joined Oct 25 20:34:25 tf2 ccsd[3988]: Cluster manager shutdown. Attemping to reconnect... Oct 25 20:34:26 tf2 httpd: httpd shutdown succeeded Oct 25 20:34:26 tf2 kernel: parted nodes Oct 25 20:34:26 tf2 kernel: clvmd rebuilt 0 resources Oct 25 20:34:26 tf2 kernel: clvmd purge requests Oct 25 20:34:26 tf2 kernel: clvmd purged 0 requests Oct 25 20:34:26 tf2 kernel: clvmd mark waiting requests Oct 25 20:34:26 tf2 kernel: clvmd marked 0 requests Oct 25 20:34:26 tf2 kernel: clvmd purge locks of departed nodes Oct 25 20:34:26 tf2 kernel: lv1 purged 1 locks Oct 25 20:34:26 tf2 kernel: lv1 update remastered resources Oct 25 20:34:26 tf2 kernel: clvmd purged 0 locks Oct 25 20:34:26 tf2 kernel: clvmd update remastered resources Oct 25 20:34:26 tf2 kernel: clvmd updated 1 resources Oct 25 20:34:26 tf2 kernel: clvmd rebuild locks Oct 25 20:34:26 tf2 kernel: clvmd rebuilt 0 locks Oct 25 20:34:26 tf2 kernel: clvmd recover event 7 done Oct 25 20:34:26 tf2 kernel: Magma move flags 0,0,1 ids 6,7,7 Oct 25 20:34:26 tf2 kernel: Magma process held requests Oct 25 20:34:26 tf2 kernel: Magma processed 0 requests Oct 25 20:34:26 tf2 kernel: Magma resend marked requests Oct 25 20:34:26 tf2 kernel: Magma resend 6403d9 lq 1 flg 200000 node -1/-1 "usrm::vf" Oct 25 20:34:26 tf2 kernel: Magma resent 1 requests Oct 25 20:34:26 tf2 kernel: Magma recover event 7 finished Oct 25 20:34:26 tf2 kernel: clvmd move flags 0,0,1 ids 2,7,7 Oct 25 20:34:26 tf2 kernel: clvmd process held requests Oct 25 20:34:26 tf2 kernel: clvmd processed 0 requests Oct 25 20:34:26 tf2 kernel: clvmd resend marked requests Oct 25 20:34:26 tf2 kernel: clvmd resent 0 requests Oct 25 20:34:26 tf2 kernel: clvmd recover event 7 finished Oct 25 20:34:26 tf2 kernel: lv1 updated 525 resources Oct 25 20:34:26 tf2 kernel: lv1 rebuild locks Oct 25 20:34:26 tf2 kernel: lv1 rebuilt 0 locks Oct 25 20:34:26 tf2 kernel: lv1 recover event 7 done Oct 25 20:34:26 tf2 kernel: lv1 move flags 0,0,1 ids 3,7,7 Oct 25 20:34:26 tf2 kernel: lv1 process held requests Oct 25 20:34:26 tf2 kernel: lv1 processed 0 requests Oct 25 20:34:26 tf2 kernel: lv1 resend marked requests Oct 25 20:34:26 tf2 kernel: lv1 resent 0 requests Oct 25 20:34:26 tf2 kernel: lv1 recover event 7 finished Oct 25 20:34:26 tf2 kernel: 4189 pr_start last_stop 0 last_start 4 last_finish 0 Oct 25 20:34:26 tf2 kernel: 4189 pr_start count 2 type 2 event 4 flags 250 Oct 25 20:34:26 tf2 kernel: 4189 claim_jid 1 Oct 25 20:34:26 tf2 kernel: 4189 pr_start 4 done 1 Oct 25 20:34:26 tf2 kernel: 4189 pr_finish flags 5a Oct 25 20:34:26 tf2 kernel: 4168 recovery_done jid 1 msg 309 a Oct 25 20:34:26 tf2 kernel: 4168 recovery_done nodeid 2 flg 18 Oct 25 20:34:26 tf2 kernel: 4189 pr_start last_stop 4 last_start 7 last_finish 4 Oct 25 20:34:26 tf2 kernel: 4189 pr_start count 1 type 1 event 7 flags 21a Oct 25 20:34:26 tf2 kernel: 4189 pr_start cb jid 0 id 1 Oct 25 20:34:26 tf2 kernel: 4189 pr_start 7 done 0 Oct 25 20:34:26 tf2 kernel: 4192 recovery_done jid 0 msg 309 11a Oct 25 20:34:26 tf2 kernel: 4192 recovery_done nodeid 1 flg 1b Oct 25 20:34:26 tf2 kernel: 4192 recovery_done start_done 7 Oct 25 20:34:26 tf2 kernel: 4189 pr_finish flags 1a Oct 25 20:34:26 tf2 kernel: Oct 25 20:34:26 tf2 kernel: lock_dlm: Assertion failed on line 428 of file /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c Oct 25 20:34:26 tf2 kernel: lock_dlm: assertion: "!error" Oct 25 20:34:26 tf2 kernel: lock_dlm: time = 623964971 Oct 25 20:34:26 tf2 kernel: lv1: num=2,1a err=-22 cur=-1 req=3 lkf=10000 Oct 25 20:34:26 tf2 kernel: Oct 25 20:34:26 tf2 kernel: ------------[ cut here ]------------ Oct 25 20:34:26 tf2 kernel: kernel BUG at /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c:428! Oct 25 20:34:26 tf2 kernel: invalid operand: 0000 [#1] Oct 25 20:34:26 tf2 kernel: SMP Oct 25 20:34:26 tf2 kernel: Modules linked in: dcdipm(U) dcdbas(U) parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc button battery ac uhci_hcd ehci_hcd hw_random shpchp eepro100 e100 mii e1000 floppy sg ext3 jbd dm_mod aic7xxx megaraid_mbox megaraid_mm sd_mod scsi_mod Oct 25 20:34:26 tf2 kernel: CPU: 2 Oct 25 20:34:26 tf2 kernel: EIP: 0060:[<f8acc779>] Tainted: P VLI Oct 25 20:34:26 tf2 kernel: EFLAGS: 00010246 (2.6.9-34.ELsmp) Oct 25 20:34:26 tf2 kernel: EIP is at do_dlm_lock+0x134/0x14e [lock_dlm] Oct 25 20:34:26 tf2 kernel: eax: 00000001 ebx: ffffffea ecx: f1be9d50 edx: f8ad115f Oct 25 20:34:26 tf2 kernel: esi: f8acc798 edi: f7e7da00 ebp: c2355b00 esp: f1be9d4c Oct 25 20:34:26 tf2 kernel: ds: 007b es: 007b ss: 0068 Oct 25 20:34:26 tf2 kernel: Process umount (pid: 13456, threadinfo=f1be9000 task=f66c7230) Oct 25 20:34:26 tf2 kernel: Stack: f8ad115f 20202020 32202020 20202020 20202020 20202020 61312020 f1f40018 Oct 25 20:34:26 tf2 kernel: f1f422b8 c2355b00 00000003 00000000 c2355b00 f8acc828 00000003 f8ad4860 Oct 25 20:34:26 tf2 kernel: f8b20000 f8bf45b2 00000008 00000001 f4fbc5c4 f4fbc5a8 f8b20000 f8bea5cd Oct 25 20:34:26 tf2 kernel: Call Trace: Oct 25 20:34:26 tf2 kernel: [<f8acc828>] lm_dlm_lock+0x49/0x52 [lock_dlm] Oct 25 20:34:26 tf2 kernel: [<f8bf45b2>] gfs_lm_lock+0x35/0x4d [gfs] Oct 25 20:34:26 tf2 kernel: [<f8bea5cd>] gfs_glock_xmote_th+0x130/0x172 [gfs] Oct 25 20:34:26 tf2 kernel: [<f8be9c91>] rq_promote+0xc8/0x147 [gfs] Oct 25 20:34:26 tf2 kernel: [<f8be9e7d>] run_queue+0x91/0xc1 [gfs] Oct 25 20:34:26 tf2 kernel: [<f8beae88>] gfs_glock_nq+0xcf/0x116 [gfs] Oct 25 20:34:26 tf2 kernel: [<f8beb40f>] gfs_glock_nq_init+0x13/0x26 [gfs] Oct 25 20:34:26 tf2 kernel: [<f8c02e64>] gfs_permission+0x0/0x61 [gfs] Oct 25 20:34:26 tf2 kernel: [<f8c02e9e>] gfs_permission+0x3a/0x61 [gfs] Oct 25 20:34:26 tf2 kernel: [<f8c02e64>] gfs_permission+0x0/0x61 [gfs] Oct 25 20:34:26 tf2 kernel: [<c0165870>] permission+0x2b/0x4f Oct 25 20:34:26 tf2 kernel: [<c0165dbf>] __link_path_walk+0x148/0xbb5 Oct 25 20:34:26 tf2 kernel: [<c016686f>] link_path_walk+0x43/0xbe Oct 25 20:34:26 tf2 kernel: [<c0150309>] do_brk+0x1f2/0x22c Oct 25 20:34:26 tf2 kernel: [<c0166c04>] path_lookup+0x14b/0x17f Oct 25 20:34:26 tf2 kernel: [<c0166d4c>] __user_walk+0x21/0x51 Oct 25 20:34:26 tf2 kernel: [<c0162460>] sys_readlink+0x20/0x82 Oct 25 20:34:26 tf2 kernel: [<c0150309>] do_brk+0x1f2/0x22c Oct 25 20:34:26 tf2 kernel: [<c011ad21>] do_page_fault+0x0/0x5c6 Oct 25 20:34:26 tf2 kernel: [<c02d2657>] syscall_call+0x7/0xb Oct 25 20:34:26 tf2 kernel: Code: 26 50 0f bf 45 24 50 53 ff 75 08 ff 75 04 ff 75 0c ff 77 18 68 8a 12 ad f8 e8 ce 5e 65 c7 83 c4 38 68 5f 11 ad f8 e8 c1 5e 65 c7 <0f> 0b ac 01 a7 10 ad f8 68 61 11 ad f8 e8 7c 56 65 c7 83 c4 20 Oct 25 20:34:26 tf2 kernel: <0>Fatal exception: panic in 5 seconds and now tf2 is unreachable too.. ideas? suggestions? Jason -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster