Cluster crashing randomly after a few hours of use.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



here's the stack of error of when that happened, all I was doing is playing with the nodes and placing them in standby, then moving them back to online. When it happened I was not working on them, I was away from my desk (no interaction at the crash moment)

here's my corosync.log

Jun 07 10:47:57 cluster_02 cib: [29308]: debug: cib_process_xpath: Processing cib_query op for //cib/configuration/nodes//node[@id='cluster_02']//instance_attributes//nvpair[@name='standby'] (/cib/configuration/nodes/node[2]/instance_attributes/nvpair)
Jun 07 10:47:57 corosync [TOTEM ] mcasted message added to pending queue
Jun 07 10:47:57 corosync [TOTEM ] Delivering 62 to 63
Jun 07 10:47:57 corosync [TOTEM ] Delivering MCAST message with seq 63 to pending delivery queue
Jun 07 10:47:57 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 63
Jun 07 10:47:57 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 64
Jun 07 10:47:57 corosync [TOTEM ] Delivering 63 to 64
Jun 07 10:47:57 corosync [TOTEM ] Delivering MCAST message with seq 64 to pending delivery queue
Jun 07 10:47:57 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 65
Jun 07 10:47:57 corosync [TOTEM ] Delivering 64 to 65
Jun 07 10:47:57 corosync [TOTEM ] Delivering MCAST message with seq 65 to pending delivery queue
Jun 07 10:47:57 corosync [TOTEM ] releasing messages up to and including 63
Jun 07 10:47:57 corosync [TOTEM ] releasing messages up to and including 65
Jun 07 10:53:12 cluster_02 cib: [29308]: info: cib_stats: Processed 5 operations (0.00us average, 0% utilization) in the last 10min Jun 07 10:53:12 cluster_02 cib: [29308]: debug: cib_stats: Detail: 69 operations (0ms total) (63 local, 31 updates, 0 failures, 0 timeouts, 0 bad connects) Jun 07 11:00:42 cluster_02 cib: [29308]: debug: cib_process_xpath: Processing cib_query op for //cib/configuration/nodes//node[@id='cluster_02']//instance_attributes//nvpair[@name='standby'] (/cib/configuration/nodes/node[2]/instance_attributes/nvpair)
Jun 07 11:00:42 corosync [TOTEM ] mcasted message added to pending queue
Jun 07 11:00:42 corosync [TOTEM ] Delivering 65 to 66
Jun 07 11:00:42 corosync [TOTEM ] Delivering MCAST message with seq 66 to pending delivery queue
Jun 07 11:00:42 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 66
Jun 07 11:00:42 corosync [TOTEM ] releasing messages up to and including 66
Jun 07 11:00:42 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 67
Jun 07 11:00:42 corosync [TOTEM ] Delivering 66 to 67
Jun 07 11:00:42 corosync [TOTEM ] Delivering MCAST message with seq 67 to pending delivery queue
Jun 07 11:00:42 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 68
Jun 07 11:00:42 corosync [TOTEM ] Delivering 67 to 68
Jun 07 11:00:42 corosync [TOTEM ] Delivering MCAST message with seq 68 to pending delivery queue
Jun 07 11:00:42 corosync [TOTEM ] releasing messages up to and including 68
Jun 07 11:01:01 cluster_02 stonith-ng: [29307]: ERROR: stonith_peer_ais_destroy: AIS connection terminated Jun 07 11:01:01 cluster_02 attrd: [29311]: CRIT: attrd_ais_destroy: Lost connection to OpenAIS service!
Jun 07 11:01:01 cluster_02 attrd: [29311]: notice: main: Exiting...
Jun 07 11:01:01 cluster_02 attrd: [29311]: debug: cib_native_signoff: Signing out of the CIB Service Jun 07 11:01:01 cluster_02 attrd: [29311]: ERROR: attrd_cib_connection_destroy: Connection to the CIB terminated... Jun 07 11:01:01 cluster_02 cib: [29308]: ERROR: cib_ais_destroy: AIS connection terminated
Jun 07 11:01:01 corosync [CPG   ] exit_fn for conn=0x8d01e60
Jun 07 11:01:01 corosync [pcmk ] info: pcmk_ipc_exit: Client stonith-ng (conn=0x8d06038, async-conn=0x8d06038) left Jun 07 11:01:01 corosync [pcmk ] info: pcmk_ipc_exit: Client attrd (conn=0x8d0a210, async-conn=0x8d0a210) left
Jun 07 11:01:01 corosync [TOTEM ] mcasted message added to pending queue
Jun 07 11:01:01 corosync [TOTEM ] Delivering 68 to 69
Jun 07 11:01:01 corosync [TOTEM ] Delivering MCAST message with seq 69 to pending delivery queue Jun 07 11:01:01 corosync [CPG ] got procleave message from cluster node 302516746
Jun 07 11:01:01 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 69
Jun 07 11:01:01 corosync [pcmk ] info: pcmk_ipc_exit: Client cib (conn=0x8d0e3e8, async-conn=0x8d0e3e8) left
Jun 07 11:01:01 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 6a
Jun 07 11:01:01 corosync [TOTEM ] Delivering 69 to 6a
Jun 07 11:01:01 corosync [TOTEM ] Delivering MCAST message with seq 6a to pending delivery queue
Jun 07 11:01:01 corosync [TOTEM ] releasing messages up to and including 69
Jun 07 11:01:01 corosync [TOTEM ] releasing messages up to and including 6a
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: xmlfromIPC: Peer disconnected Jun 07 11:01:01 cluster_02 crmd: [29313]: info: cib_native_msgready: Lost connection to the CIB service [29308]. Jun 07 11:01:01 cluster_02 crmd: [29313]: CRIT: cib_native_dispatch: Lost connection to the CIB service [29308/callback]. Jun 07 11:01:01 cluster_02 crmd: [29313]: CRIT: cib_native_dispatch: Lost connection to the CIB service [29308/command]. Jun 07 11:01:01 cluster_02 crmd: [29313]: ERROR: crmd_cib_connection_destroy: Connection to the CIB terminated... Jun 07 11:01:01 cluster_02 crmd: [29313]: info: crmd_ais_destroy: connection closed Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: xmlfromIPC: Peer disconnected Jun 07 11:01:01 cluster_02 crmd: [29313]: info: stonith_msgready: Lost connection to the STONITH service [29307]. Jun 07 11:01:01 cluster_02 crmd: [29313]: CRIT: stonith_dispatch_internal: Lost connection to the STONITH service [29307/callback]. Jun 07 11:01:01 cluster_02 crmd: [29313]: CRIT: stonith_dispatch_internal: Lost connection to the STONITH service [29307/command]. Jun 07 11:01:01 cluster_02 crmd: [29313]: CRIT: tengine_stonith_connection_destroy: Fencing daemon connection failed Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: s_crmd_fsa: Processing I_ERROR: [ state=S_NOT_DC cause=C_FSA_INTERNAL origin=crmd_cib_connection_destroy ] Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_ERROR Jun 07 11:01:01 cluster_02 crmd: [29313]: ERROR: do_log: FSA: Input I_ERROR from crmd_cib_connection_destroy() received in state S_NOT_DC Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=crmd_cib_connection_destroy ] Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_DC_TIMER_STOP Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_INTEGRATE_TIMER_STOP Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_FINALIZE_TIMER_STOP Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_RECOVER Jun 07 11:01:01 cluster_02 crmd: [29313]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: s_crmd_fsa: Processing I_TERMINATE: [ state=S_RECOVERY cause=C_FSA_INTERNAL origin=do_recover ] Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_ERROR Jun 07 11:01:01 cluster_02 crmd: [29313]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_state_transition: State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ] Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_DC_TIMER_STOP Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_INTEGRATE_TIMER_STOP Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_FINALIZE_TIMER_STOP Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_SHUTDOWN Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_shutdown: Disconnecting STONITH... Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: stonith_api_signoff: Signing out of the STONITH Service Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_LRM_DISCONNECT Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: verify_stopped: Checking for active resources before exit Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_lrm_control: Disconnected from the LRM Jun 07 11:01:01 cluster_02 lrmd: [29310]: debug: on_receive_cmd: the IPC to client [pid:29313] disconnected. Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_CCM_DISCONNECT Jun 07 11:01:01 cluster_02 lrmd: [29310]: debug: unregister_client: client crmd [pid:29313] is unregistered Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_HA_DISCONNECT Jun 07 11:01:01 cluster_02 crmd: [29313]: notice: terminate_ais_connection: Disconnecting from AIS Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_ha_control: Disconnected from OpenAIS Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_CIB_STOP Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_cib_control: Disconnecting CIB Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: cib_client_del_notify_callback: Removing callback for cib_diff_notify events Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_STOP Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: actions:trace: // A_EXIT_0 Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: verify_stopped: Checking for active resources before exit Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd Jun 07 11:01:01 cluster_02 crmd: [29313]: ERROR: do_exit: Could not recover from internal error Jun 07 11:01:01 cluster_02 crmd: [29313]: info: free_mem: Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ] Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: free_mem: Number of connected clients: 0 Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: free_mem: Partial destroy: TE Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: free_mem: Partial destroy: PE Jun 07 11:01:01 cluster_02 crmd: [29313]: info: crm_xml_cleanup: Cleaning up memory from libxml2
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_exit: [crmd] stopped (2)


I am aware that at this moment my stonith is not configured, reason is that I am not putting any configuration, simply running 2 servers with no load or resources to test the actual cluster. The slave machine dies.

this is my corosync.conf

compatibility: whitetank

totem {
    version: 2
    token: 3000
    token_retransmits_before_loss_const: 10
    join: 60
    consensus: 3600
    vsftype: none
    max_messages: 20
    clear_node_high_bit: yes
    secauth: off
    threads: 0
    rrp_mode: none
    interface {
        ringnumber: 0
        bindnetaddr: 10.10.0.0
        mcastaddr: 226.18.1.1
        mcastport: 6006
    }
}
service {
    ver:    1
    name:    pacemaker
}
aisexec {
    user: root
    group: root
}
logging {
    fileline: off
    to_stderr: yes
    to_logfile: yes
    to_syslog: yes
    logfile: /var/log/cluster/corosync.log
    debug: on
    timestamp: on
    logger_subsys {
        subsys: AMF
        debug: on
    }
}
amf {
    mode: disabled
}

Thanks in advance!

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss


[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux