Fwd: [Pacemaker] One node thinks everyone is online, but the other node doesn't think so

Dmitry Koterov <dmitry.koterov@xxxxxxxxx> · Tue, 10 Mar 2015 18:27:52 +0300

On Tue, Feb 24, 2015 at 2:07 AM, Andrew Beekhof <andrew@xxxxxxxxxxx> wrote:

> I have a 3-node cluster where node1 and node2 are running corosync+pacemaker and node3 is running corosync only (for quorum). Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple of days.

>

> Once upon a time I discovered the following situation: node2 thinks that both node1 and node2 are online, but node1 thinks that node2 is down. Could you please say: how could it be? There are no connectivity problems between the nodes at the moment (maybe they were, but why the system hasn't recovered?).

The logs show connectivity problems occurring, so no doubt there.

As to why it hasn't recovered, first check corosync - if it does not have a consistent view of the world pacemaker has no hope.

Alternatively, I recall there was a bug that could be preventing this in your version.  So if corosync looks fine, perhaps try an upgrade.

Thanks.
Are you talking about this bug: https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496 ?

I believe I reproduced the problem one more time (it's very unstable), the symptoms were the following:

1. Once upon a time node2 became down.
2. The last message from corosync at node1 was "Quorum lost" (I suspect there was a temporary misconnection with node3).
3. Then, in a couple of days, at node3 "service corosync stop" hanged (only killall -9 helps). I tried to run strace during the service is stopping, it shows:

[pid 19449] futex(0x7f580b4c62e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 19448] --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=28183, si_uid=0} ---
[pid 19448] write(6, "\3\0\0\0", 4)     = 4
[pid 19448] rt_sigreturn()              = 360
... <and repeats for 19448 again and again>

where pstree shows:

init,1
  ├─corosync,19448
  │   └─{corosync},19449

4. As well as at node1: "service corosync stop" hangs at node1 too with same symptoms, only killall -9 helps.
5. Restarting corosync & pacemaker at node1 and node2 solved the problem.

Could you please say is it related to the above bug in libqb?

> The "crm status" is below. What other logs should I attach for the diagnostics?

>

> Also, "service corosync stop" on node1 hangs forever with no additional lines in logs, so I cannot even stop the service. (But after "service corosync stop" on node1 the node node2 starts thinking that node1 is offline, although the command still hangs.)

>

>

> root@node2:~# crm status

> Current DC: node1 (1760315215) - partition with quorum

> 2 Nodes configured

> 6 Resources configured

> Online: [ node1 node2 ]

> Master/Slave Set: ms_drbd [drbd]

>      Masters: [ node2 ]

>      Slaves: [ node1 ]

> Resource Group: server

>      fs       (ocf::heartbeat:Filesystem):    Started node2

>      postgresql       (lsb:postgresql):       Started node2

>      bind9    (lsb:bind9):    Started node2

>      nginx    (lsb:nginx):    Started node2

>

>

> root@node1:/var/log# crm status

> Current DC: node1 (1760315215) - partition with quorum

> 2 Nodes configured

> 6 Resources configured

> Online: [ node1 ]

> OFFLINE: [ node2 ]

> Master/Slave Set: ms_drbd [drbd]

>      Masters: [ node1 ]

>      Stopped: [ node2 ]

> Resource Group: server

>      fs       (ocf::heartbeat:Filesystem):    Started node1

>      postgresql       (lsb:postgresql):       Started node1

>      bind9    (lsb:bind9):    Started node1

>      nginx    (lsb:nginx):    Started node1

> Failed actions:

>     drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out, last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown error

>

>

> A part of "crm configure show":

>

> property $id="cib-bootstrap-options" \

>         dc-version="1.1.10-42f2063" \

>         cluster-infrastructure="corosync" \

>         stonith-enabled="false" \

>         last-lrm-refresh="1421250983"

> rsc_defaults $id="rsc-options" \

>         resource-stickiness="100"

>

>

> Also I see in logs on node1 (maybe they're related to the issue, maybe not):

>

> Jan 22 10:14:02 node1 pengine[2772]:  warning: pe_fence_node: Node node2 is unclean because it is partially and/or un-expectedly down

> Jan 22 10:14:02 node1 pengine[2772]:  warning: determine_online_status: Node node2 is unclean

> Jan 22 10:14:02 node1 pengine[2772]:  warning: stage6: Node node2 is unclean!

> Jan 22 10:14:02 node1 pengine[2772]:  warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED

> Jan 22 10:14:02 node1 pengine[2772]:    error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE

>

>

> On node2 the logs are:

>

> Jan 22 10:13:57 node2 corosync[32761]:  [TOTEM ] A new membership (188.166.54.190:6276) was formed. Members left: 1760315215 13071578

> Jan 22 10:13:57 node2 crmd[311]:   notice: peer_update_callback: Our peer on the DC is dead

> Jan 22 10:13:57 node2 crmd[311]:   notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]

> Jan 22 10:13:57 node2 corosync[32761]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.

> Jan 22 10:13:57 node2 corosync[32761]:  [QUORUM] Members[1]: 1017525950

> Jan 22 10:13:57 node2 crmd[311]:   notice: pcmk_quorum_notification: Membership 6276: quorum lost (1)

> Jan 22 10:13:57 node2 crmd[311]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was member)

> Jan 22 10:13:57 node2 crmd[311]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was member)

> Jan 22 10:13:57 node2 pacemakerd[302]:   notice: pcmk_quorum_notification: Membership 6276: quorum lost (1)

> Jan 22 10:13:57 node2 pacemakerd[302]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was member)

> Jan 22 10:13:57 node2 pacemakerd[302]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was member)

> Jan 22 10:13:57 node2 corosync[32761]:  [MAIN  ] Completed service synchronization, ready to provide service.

> Jan 22 10:14:01 node2 corosync[32761]:  [TOTEM ] A new membership (104.236.71.79:6288) was formed. Members joined: 1760315215 13071578

> Jan 22 10:14:02 node2 crmd[311]:    error: pcmk_cpg_membership: Node node1[1760315215] appears to be online even though we think it is dead

> Jan 22 10:14:02 node2 crmd[311]:   notice: crm_update_peer_state: pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost)

> Jan 22 10:14:03 node2 corosync[32761]:  [QUORUM] This node is within the primary component and will provide service.

> Jan 22 10:14:03 node2 corosync[32761]:  [QUORUM] Members[3]: 1760315215 13071578 1017525950

> Jan 22 10:14:03 node2 crmd[311]:   notice: pcmk_quorum_notification: Membership 6288: quorum acquired (3)

> Jan 22 10:14:03 node2 pacemakerd[302]:   notice: pcmk_quorum_notification: Membership 6288: quorum acquired (3)

> Jan 22 10:14:03 node2 pacemakerd[302]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now member (was lost)

> Jan 22 10:14:03 node2 corosync[32761]:  [MAIN  ] Completed service synchronization, ready to provide service.

> Jan 22 10:14:03 node2 crmd[311]:   notice: corosync_node_name: Unable to get node name for nodeid 13071578

> Jan 22 10:14:03 node2 crmd[311]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now member (was lost)

> Jan 22 10:14:03 node2 pacemakerd[302]:   notice: corosync_node_name: Unable to get node name for nodeid 13071578

> Jan 22 10:14:03 node2 pacemakerd[302]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now member (was lost)

> Jan 22 10:14:03 node2 crmd[311]:  warning: do_log: FSA: Input I_JOIN_OFFER from route_message() received in state S_ELECTION

> Jan 22 10:14:04 node2 crmd[311]:   notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]

> Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)

> Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd (10000)

> Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)

> Jan 22 10:14:05 node2 crmd[311]:   notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]

> Jan 22 10:15:11 node2 corosync[32761]:  [TOTEM ] A new membership (104.236.71.79:6296) was formed. Members left: 13071578

> Jan 22 10:15:14 node2 corosync[32761]:  [TOTEM ] A new membership (128.199.116.218:6312) was formed. Members joined: 13071578 left: 1760315215

> Jan 22 10:15:17 node2 corosync[32761]:  [TOTEM ] A new membership (104.236.71.79:6324) was formed. Members joined: 1760315215

> Jan 22 10:15:19 node2 crmd[311]:   notice: peer_update_callback: Our peer on the DC is dead

> Jan 22 10:15:19 node2 crmd[311]:   notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]

> Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary -> Secondary )

> Jan 22 10:15:20 node2 corosync[32761]:  [QUORUM] Members[3]: 1760315215 13071578 1017525950

> Jan 22 10:15:20 node2 corosync[32761]:  [MAIN  ] Completed service synchronization, ready to provide service.

>

> _______________________________________________

> Pacemaker mailing list: Pacemaker@xxxxxxxxxxxxxxxxxxx

> http://oss.clusterlabs.org/mailman/listinfo/pacemaker

>

> Project Home: http://www.clusterlabs.org

> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

> Bugs: http://bugs.clusterlabs.org

_______________________________________________

Pacemaker mailing list: Pacemaker@xxxxxxxxxxxxxxxxxxx

http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org

Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss