GFS 6.0 Bad State Change

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



All,

 

I have a GFS 6.0 cluster with 10 client nodes and 3 dedicated lock servers, all servers are AS3 U5ish.  The client nodes are split between two switches and the lock servers are nic bonded across the two switches.  The general idea being that if one or the other switches was to fail half the client nodes would stay up and all 3 lock servers.  During a switch maintenance this past weekend, the clients on the affected switch were removed from the cluster prior to the maintenance and when the switch was taken down the master lock server, whose primary interface is on that switch, immediately failed over its bond.  Unfortunately at that point it lost communication with all remaining nodes in the cluster…

 

Jul 22 01:06:26 server15 kernel: tg3: eth1: Link is down.

Jul 22 01:06:26 server15 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it

Jul 22 01:06:26 server15 kernel: bonding: bond0: making interface eth0 the new active one.

Jul 22 01:06:38 server15 lock_gulmd_core[902]: server14 missed a heartbeat (time:1153548398431490 mb:1)

Jul 22 01:06:38 server15 lock_gulmd_core[902]: server01 missed a heartbeat (time:1153548398431490 mb:1)

Jul 22 01:06:38 server15 lock_gulmd_core[902]: server07 missed a heartbeat (time:1153548398431490 mb:1)

Jul 22 01:06:38 server15 lock_gulmd_core[902]: server05 missed a heartbeat (time:1153548398431490 mb:1)

Jul 22 01:06:38 server15 lock_gulmd_core[902]: server20 missed a heartbeat (time:1153548398431490 mb:1)

Jul 22 01:06:38 server15 lock_gulmd_core[902]: server03 missed a heartbeat (time:1153548398431490 mb:1)

Jul 22 01:06:38 server15 lock_gulmd_core[902]: server21 missed a heartbeat (time:1153548398431490 mb:1)

Jul 22 01:06:45 server15 lock_gulmd_core[902]: server16 missed a heartbeat (time:1153548405951416 mb:1)

Jul 22 01:06:53 server15 lock_gulmd_core[902]: server14 missed a heartbeat (time:1153548413471310 mb:2)

Jul 22 01:06:53 server15 lock_gulmd_core[902]: server01 missed a heartbeat (time:1153548413471310 mb:2)

Jul 22 01:06:53 server15 lock_gulmd_core[902]: server07 missed a heartbeat (time:1153548413471310 mb:2)

Jul 22 01:06:53 server15 lock_gulmd_core[902]: server05 missed a heartbeat (time:1153548413471310 mb:2)

Jul 22 01:06:53 server15 lock_gulmd_core[902]: server20 missed a heartbeat (time:1153548413471310 mb:2)

Jul 22 01:06:53 server15 lock_gulmd_core[902]: server03 missed a heartbeat (time:1153548413471310 mb:2)

Jul 22 01:06:53 server15 lock_gulmd_core[902]: server21 missed a heartbeat (time:1153548413471310 mb:2)

Jul 22 01:07:00 server15 lock_gulmd_core[902]: server16 missed a heartbeat (time:1153548420991220 mb:2)

Jul 22 01:07:01 server15 lock_gulmd_LT000[905]: EOF on xdr (server14:10.146.128.154 idx:15 fd:20)

Jul 22 01:07:01 server15 lock_gulmd_LT001[906]: EOF on xdr (server14:10.146.128.154 idx:16 fd:21)

Jul 22 01:07:01 server15 lock_gulmd_LT002[907]: EOF on xdr (server14:10.146.128.154 idx:16 fd:21)

Jul 22 01:07:01 server15 lock_gulmd_LT001[906]: EOF on xdr (server14:10.146.128.154 idx:15 fd:20)

Jul 22 01:07:01 server15 lock_gulmd_LT002[907]: EOF on xdr (server14:10.146.128.154 idx:15 fd:20)

Jul 22 01:07:01 server15 lock_gulmd_LT004[909]: EOF on xdr (server14:10.146.128.154 idx:16 fd:21)

Jul 22 01:07:01 server15 lock_gulmd_LT004[909]: EOF on xdr (server14:10.146.128.154 idx:15 fd:20)

Jul 22 01:07:01 server15 lock_gulmd_LT003[908]: EOF on xdr (server14:10.146.128.154 idx:16 fd:21)

Jul 22 01:07:01 server15 lock_gulmd_LT003[908]: EOF on xdr (server14:10.146.128.154 idx:15 fd:20)

Jul 22 01:07:01 server15 lock_gulmd_LT000[905]: EOF on xdr (server14:10.146.128.154 idx:16 fd:21)

Jul 22 01:07:02 server15 lock_gulmd_core[902]:  (10.146.128.154:server14) Cannot login if you are expired.

Jul 22 01:07:02 server15 lock_gulmd_core[902]:  (10.146.128.130:server01) Cannot login if you are expired.

Jul 22 01:07:03 server15 lock_gulmd_core[902]:  (10.146.128.136:server07) Cannot login if you are expired.

 

At this point the other servers in the cluster are reporting…

 

Jul 22 01:06:31 server01 lock_gulmd_core[7513]: Failed to receive a timely heartbeat reply from Master. (t:1153548391559940 mb:1)

Jul 22 01:06:46 server01 lock_gulmd_core[7513]: Failed to receive a timely heartbeat reply from Master. (t:1153548406560307 mb:2)

Jul 22 01:07:01 server01 lock_gulmd_core[7513]: Failed to receive a timely heartbeat reply from Master. (t:1153548421560674 mb:3)

Jul 22 01:07:02 server01 lock_gulmd_core[7513]: ERROR [core_io.c:1084] Got error from reply: (server15:10.146.128.155) 1008:Bad State Change

Jul 22 01:07:35 server01 last message repeated 11 times

Jul 22 01:07:56 server01 last message repeated 7 times

Jul 22 01:07:59 server01 lock_gulmd_core[7513]: ERROR [core_io.c:1084] Got error from reply: (server15:10.146.128.155) 1008:Bad State Change

Jul 22 01:08:32 server01 last message repeated 11 times

Jul 22 01:08:35 server01 lock_gulmd_core[7513]: ERROR [core_io.c:1084] Got error from reply: (server15:10.146.128.155) 1008:Bad State Change

Jul 22 01:08:38 server01 lock_gulmd_core[7513]: ERROR [core_io.c:1084] Got error from reply: (server15:10.146.128.155) 1008:Bad State Change

Jul 22 01:08:56 server01 last message repeated 6 times

Jul 22 01:08:59 server01 lock_gulmd_core[7513]: ERROR [core_io.c:1084] Got error from reply: (server15:10.146.128.155) 1008:Bad State Change

 

 

Why am I seeing “Bad State Change?”  Does anyone have any successful experience with GFS and nic bonding?

 

 

 

Thanks,

 

Britt Treece

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux