All, I have a GFS 6.0 cluster with 10 client nodes and 3
dedicated lock servers, all servers are AS3 U5ish. The client nodes are
split between two switches and the lock servers are nic bonded across the two
switches. The general idea being that if one or the other switches was to
fail half the client nodes would stay up and all 3 lock servers. During a
switch maintenance this past weekend, the clients on the affected switch were
removed from the cluster prior to the maintenance and when the switch was taken
down the master lock server, whose primary interface is on that switch, immediately
failed over its bond. Unfortunately at that point it lost communication
with all remaining nodes in the cluster… Jul 22 01:06:26 server15 kernel: tg3: eth1: Link is
down. Jul 22 01:06:26 server15 kernel: bonding: bond0:
link status definitely down for interface eth1, disabling it Jul 22 01:06:26 server15 kernel: bonding: bond0:
making interface eth0 the new active one. Jul 22 01:06:38 server15 lock_gulmd_core[902]: server14
missed a heartbeat (time:1153548398431490 mb:1) Jul 22 01:06:38 server15 lock_gulmd_core[902]: server01
missed a heartbeat (time:1153548398431490 mb:1) Jul 22 01:06:38 server15 lock_gulmd_core[902]: server07
missed a heartbeat (time:1153548398431490 mb:1) Jul 22 01:06:38 server15 lock_gulmd_core[902]: server05
missed a heartbeat (time:1153548398431490 mb:1) Jul 22 01:06:38 server15 lock_gulmd_core[902]: server20
missed a heartbeat (time:1153548398431490 mb:1) Jul 22 01:06:38 server15 lock_gulmd_core[902]: server03
missed a heartbeat (time:1153548398431490 mb:1) Jul 22 01:06:38 server15 lock_gulmd_core[902]: server21
missed a heartbeat (time:1153548398431490 mb:1) Jul 22 01:06:45 server15 lock_gulmd_core[902]: server16
missed a heartbeat (time:1153548405951416 mb:1) Jul 22 01:06:53 server15 lock_gulmd_core[902]: server14
missed a heartbeat (time:1153548413471310 mb:2) Jul 22 01:06:53 server15 lock_gulmd_core[902]: server01
missed a heartbeat (time:1153548413471310 mb:2) Jul 22 01:06:53 server15 lock_gulmd_core[902]: server07
missed a heartbeat (time:1153548413471310 mb:2) Jul 22 01:06:53 server15 lock_gulmd_core[902]: server05
missed a heartbeat (time:1153548413471310 mb:2) Jul 22 01:06:53 server15 lock_gulmd_core[902]: server20
missed a heartbeat (time:1153548413471310 mb:2) Jul 22 01:06:53 server15 lock_gulmd_core[902]: server03
missed a heartbeat (time:1153548413471310 mb:2) Jul 22 01:06:53 server15 lock_gulmd_core[902]: server21
missed a heartbeat (time:1153548413471310 mb:2) Jul 22 01:07:00 server15 lock_gulmd_core[902]: server16
missed a heartbeat (time:1153548420991220 mb:2) Jul 22 01:07:01 server15 lock_gulmd_LT000[905]: EOF
on xdr (server14:10.146.128.154 idx:15 fd:20) Jul 22 01:07:01 server15 lock_gulmd_LT001[906]: EOF
on xdr (server14:10.146.128.154 idx:16 fd:21) Jul 22 01:07:01 server15 lock_gulmd_LT002[907]: EOF
on xdr (server14:10.146.128.154 idx:16 fd:21) Jul 22 01:07:01 server15 lock_gulmd_LT001[906]: EOF
on xdr (server14:10.146.128.154 idx:15 fd:20) Jul 22 01:07:01 server15 lock_gulmd_LT002[907]: EOF
on xdr (server14:10.146.128.154 idx:15 fd:20) Jul 22 01:07:01 server15 lock_gulmd_LT004[909]: EOF
on xdr (server14:10.146.128.154 idx:16 fd:21) Jul 22 01:07:01 server15 lock_gulmd_LT004[909]: EOF
on xdr (server14:10.146.128.154 idx:15 fd:20) Jul 22 01:07:01 server15 lock_gulmd_LT003[908]: EOF
on xdr (server14:10.146.128.154 idx:16 fd:21) Jul 22 01:07:01 server15 lock_gulmd_LT003[908]: EOF
on xdr (server14:10.146.128.154 idx:15 fd:20) Jul 22 01:07:01 server15 lock_gulmd_LT000[905]: EOF
on xdr (server14:10.146.128.154 idx:16 fd:21) Jul 22 01:07:02 server15 lock_gulmd_core[902]:
(10.146.128.154:server14) Cannot login if you are expired. Jul 22 01:07:02 server15 lock_gulmd_core[902]:
(10.146.128.130:server01) Cannot login if you are expired. Jul 22 01:07:03 server15 lock_gulmd_core[902]:
(10.146.128.136:server07) Cannot login if you are expired. At this point the other servers in the cluster are reporting… Jul 22 01:06:31 server01 lock_gulmd_core[7513]:
Failed to receive a timely heartbeat reply from Master. (t:1153548391559940
mb:1) Jul 22 01:06:46 server01 lock_gulmd_core[7513]:
Failed to receive a timely heartbeat reply from Master. (t:1153548406560307
mb:2) Jul 22 01:07:01 server01 lock_gulmd_core[7513]:
Failed to receive a timely heartbeat reply from Master. (t:1153548421560674
mb:3) Jul 22 01:07:02 server01 lock_gulmd_core[7513]:
ERROR [core_io.c:1084] Got error from reply: (server15:10.146.128.155) 1008:Bad
State Change Jul 22 01:07:35 server01 last message repeated 11
times Jul 22 01:07:56 server01 last message repeated 7
times Jul 22 01:07:59 server01 lock_gulmd_core[7513]:
ERROR [core_io.c:1084] Got error from reply: (server15:10.146.128.155) 1008:Bad
State Change Jul 22 01:08:32 server01 last message repeated 11
times Jul 22 01:08:35 server01 lock_gulmd_core[7513]:
ERROR [core_io.c:1084] Got error from reply: (server15:10.146.128.155) 1008:Bad
State Change Jul 22 01:08:38 server01 lock_gulmd_core[7513]:
ERROR [core_io.c:1084] Got error from reply: (server15:10.146.128.155) 1008:Bad
State Change Jul 22 01:08:56 server01 last message repeated 6
times Jul 22 01:08:59 server01 lock_gulmd_core[7513]:
ERROR [core_io.c:1084] Got error from reply: (server15:10.146.128.155) 1008:Bad
State Change Why am I seeing “ Thanks, Britt Treece |
-- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster