On Wed, 2006-10-25 at 20:56 -0400, jason@xxxxxxxxxxxxxx wrote: > ok, I was just logging into the 2 nodes of my cluster, tf1 and tf2, I noticed that tf1 was NOT > available via ssh, but tf2 was. tf1 was pingable, but that was it. I looked on tft2 and > noticed that he had taken over the cluster virtual ip address > > 2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000 > link/ether 00:11:43:d7:c9:c6 brd ff:ff:ff:ff:ff:ff > inet 192.168.1.6/24 brd 192.168.1.255 scope global eth0 > inet 192.168.1.7/32 scope global eth0 > inet6 fe80::211:43ff:fed7:c9c6/64 scope link > valid_lft forever preferred_lft forever Well, I can walk through what happened here. > Oct 25 20:26:00 tf2 kernel: CMAN: removing node tf1 from the cluster : Missed too many > heartbeats Node died for some reason. > Oct 25 20:26:00 tf2 fenced[4091]: tf1 not a cluster member after 0 sec post_fail_delay > Oct 25 20:26:00 tf2 fenced[4091]: fencing node "tf1" > Oct 25 20:26:04 tf2 kernel: e100: eth2: e100_watchdog: link down > Oct 25 20:26:08 tf2 fenced[4091]: fence "tf1" success ^^ Fence recovery. > Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Trying to acquire journal > lock... ... > Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Done ^^ GFS recovery > Oct 25 20:26:27 tf2 clurgmgrd[4903]: <info> Magma Event: Membership Change > Oct 25 20:26:27 tf2 clurgmgrd[4903]: <info> State change: tf1 DOWN ^^ Rgmanager recovery > Oct 25 20:26:27 tf2 clurgmgrd[4903]: <notice> Starting stopped service Apache Service > Oct 25 20:26:29 tf2 httpd: httpd startup succeeded > Oct 25 20:26:29 tf2 clurgmgrd[4903]: <notice> Service Apache Service started > Oct 25 20:26:36 tf2 kernel: e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex > Oct 25 20:28:08 tf2 kernel: e100: eth2: e100_watchdog: link down > Oct 25 20:28:10 tf2 kernel: e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex > Oct 25 20:29:40 tf2 kernel: CMAN: node tf1 rejoining ^^ CMAN restarted on tf1 (rebooted) > Oct 25 20:34:25 tf2 kernel: CMAN: too many transition restarts - will die > Oct 25 20:34:25 tf2 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view Argh. That's not good. I *think* this is a bug in CMAN-kernel in U3, which was fixed in U4. > Oct 25 20:34:26 tf2 kernel: lock_dlm: Assertion failed on line 428 of file > /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c > Oct 25 20:34:26 tf2 kernel: lock_dlm: assertion: "!error" ... > Oct 25 20:34:26 tf2 kernel: ------------[ cut here ]------------ ... > Oct 25 20:34:26 tf2 kernel: <0>Fatal exception: panic in 5 seconds > > and now tf2 is unreachable too.. > ideas? suggestions? The panic above is a bug in the dlm-kernel rpm/package; I don't know much more than that. When a machine panics, it stops responding to things over the network. -- Lon -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster