Cluster crashing randomly after a few hours of use.

Matteo Bignotti <mbignotti@xxxxxxxxxxxxx> · Thu, 07 Jun 2012 11:47:58 -0700

here's the stack of error of when that happened, all I was doing is 
playing with the nodes and placing them in standby, then moving them 
back to online. When it happened I was not working on them, I was away 
from my desk (no interaction at the crash moment)

here's my corosync.log

Jun 07 10:47:57 cluster_02 cib: [29308]: debug: cib_process_xpath: 
Processing cib_query op for 
//cib/configuration/nodes//node[@id='cluster_02']//instance_attributes//nvpair[@name='standby'] 
(/cib/configuration/nodes/node[2]/instance_attributes/nvpair)
Jun 07 10:47:57 corosync [TOTEM ] mcasted message added to pending queue
Jun 07 10:47:57 corosync [TOTEM ] Delivering 62 to 63
Jun 07 10:47:57 corosync [TOTEM ] Delivering MCAST message with seq 63 
to pending delivery queue
Jun 07 10:47:57 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 63
Jun 07 10:47:57 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 64
Jun 07 10:47:57 corosync [TOTEM ] Delivering 63 to 64
Jun 07 10:47:57 corosync [TOTEM ] Delivering MCAST message with seq 64 
to pending delivery queue
Jun 07 10:47:57 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 65
Jun 07 10:47:57 corosync [TOTEM ] Delivering 64 to 65
Jun 07 10:47:57 corosync [TOTEM ] Delivering MCAST message with seq 65 
to pending delivery queue
Jun 07 10:47:57 corosync [TOTEM ] releasing messages up to and including 63
Jun 07 10:47:57 corosync [TOTEM ] releasing messages up to and including 65
Jun 07 10:53:12 cluster_02 cib: [29308]: info: cib_stats: Processed 5 
operations (0.00us average, 0% utilization) in the last 10min
Jun 07 10:53:12 cluster_02 cib: [29308]: debug: cib_stats:  Detail: 69 
operations (0ms total) (63 local, 31 updates, 0 failures, 0 timeouts, 0 
bad connects)
Jun 07 11:00:42 cluster_02 cib: [29308]: debug: cib_process_xpath: 
Processing cib_query op for 
//cib/configuration/nodes//node[@id='cluster_02']//instance_attributes//nvpair[@name='standby'] 
(/cib/configuration/nodes/node[2]/instance_attributes/nvpair)
Jun 07 11:00:42 corosync [TOTEM ] mcasted message added to pending queue
Jun 07 11:00:42 corosync [TOTEM ] Delivering 65 to 66
Jun 07 11:00:42 corosync [TOTEM ] Delivering MCAST message with seq 66 
to pending delivery queue
Jun 07 11:00:42 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 66
Jun 07 11:00:42 corosync [TOTEM ] releasing messages up to and including 66
Jun 07 11:00:42 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 67
Jun 07 11:00:42 corosync [TOTEM ] Delivering 66 to 67
Jun 07 11:00:42 corosync [TOTEM ] Delivering MCAST message with seq 67 
to pending delivery queue
Jun 07 11:00:42 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 68
Jun 07 11:00:42 corosync [TOTEM ] Delivering 67 to 68
Jun 07 11:00:42 corosync [TOTEM ] Delivering MCAST message with seq 68 
to pending delivery queue
Jun 07 11:00:42 corosync [TOTEM ] releasing messages up to and including 68
Jun 07 11:01:01 cluster_02 stonith-ng: [29307]: ERROR: 
stonith_peer_ais_destroy: AIS connection terminated
Jun 07 11:01:01 cluster_02 attrd: [29311]: CRIT: attrd_ais_destroy: Lost 
connection to OpenAIS service!
Jun 07 11:01:01 cluster_02 attrd: [29311]: notice: main: Exiting...
Jun 07 11:01:01 cluster_02 attrd: [29311]: debug: cib_native_signoff: 
Signing out of the CIB Service
Jun 07 11:01:01 cluster_02 attrd: [29311]: ERROR: 
attrd_cib_connection_destroy: Connection to the CIB terminated...
Jun 07 11:01:01 cluster_02 cib: [29308]: ERROR: cib_ais_destroy: AIS 
connection terminated
Jun 07 11:01:01 corosync [CPG   ] exit_fn for conn=0x8d01e60
Jun 07 11:01:01 corosync [pcmk  ] info: pcmk_ipc_exit: Client stonith-ng 
(conn=0x8d06038, async-conn=0x8d06038) left
Jun 07 11:01:01 corosync [pcmk  ] info: pcmk_ipc_exit: Client attrd 
(conn=0x8d0a210, async-conn=0x8d0a210) left
Jun 07 11:01:01 corosync [TOTEM ] mcasted message added to pending queue
Jun 07 11:01:01 corosync [TOTEM ] Delivering 68 to 69
Jun 07 11:01:01 corosync [TOTEM ] Delivering MCAST message with seq 69 
to pending delivery queue
Jun 07 11:01:01 corosync [CPG   ] got procleave message from cluster 
node 302516746
Jun 07 11:01:01 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 69
Jun 07 11:01:01 corosync [pcmk  ] info: pcmk_ipc_exit: Client cib 
(conn=0x8d0e3e8, async-conn=0x8d0e3e8) left
Jun 07 11:01:01 corosync [TOTEM ] Received ringid(10.10.8.17:1684) seq 6a
Jun 07 11:01:01 corosync [TOTEM ] Delivering 69 to 6a
Jun 07 11:01:01 corosync [TOTEM ] Delivering MCAST message with seq 6a 
to pending delivery queue
Jun 07 11:01:01 corosync [TOTEM ] releasing messages up to and including 69
Jun 07 11:01:01 corosync [TOTEM ] releasing messages up to and including 6a
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: xmlfromIPC: Peer 
disconnected
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: cib_native_msgready: 
Lost connection to the CIB service [29308].
Jun 07 11:01:01 cluster_02 crmd: [29313]: CRIT: cib_native_dispatch: 
Lost connection to the CIB service [29308/callback].
Jun 07 11:01:01 cluster_02 crmd: [29313]: CRIT: cib_native_dispatch: 
Lost connection to the CIB service [29308/command].
Jun 07 11:01:01 cluster_02 crmd: [29313]: ERROR: 
crmd_cib_connection_destroy: Connection to the CIB terminated...
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: crmd_ais_destroy: 
connection closed
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: xmlfromIPC: Peer 
disconnected
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: stonith_msgready: Lost 
connection to the STONITH service [29307].
Jun 07 11:01:01 cluster_02 crmd: [29313]: CRIT: 
stonith_dispatch_internal: Lost connection to the STONITH service 
[29307/callback].
Jun 07 11:01:01 cluster_02 crmd: [29313]: CRIT: 
stonith_dispatch_internal: Lost connection to the STONITH service 
[29307/command].
Jun 07 11:01:01 cluster_02 crmd: [29313]: CRIT: 
tengine_stonith_connection_destroy: Fencing daemon connection failed
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: s_crmd_fsa: Processing 
I_ERROR: [ state=S_NOT_DC cause=C_FSA_INTERNAL 
origin=crmd_cib_connection_destroy ]
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_ERROR
Jun 07 11:01:01 cluster_02 crmd: [29313]: ERROR: do_log: FSA: Input 
I_ERROR from crmd_cib_connection_destroy() received in state S_NOT_DC
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_state_transition: 
State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR 
cause=C_FSA_INTERNAL origin=crmd_cib_connection_destroy ]
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_DC_TIMER_STOP
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_INTEGRATE_TIMER_STOP
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_FINALIZE_TIMER_STOP
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_RECOVER
Jun 07 11:01:01 cluster_02 crmd: [29313]: ERROR: do_recover: Action 
A_RECOVER (0000000001000000) not supported
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: s_crmd_fsa: Processing 
I_TERMINATE: [ state=S_RECOVERY cause=C_FSA_INTERNAL origin=do_recover ]
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_ERROR
Jun 07 11:01:01 cluster_02 crmd: [29313]: ERROR: do_log: FSA: Input 
I_TERMINATE from do_recover() received in state S_RECOVERY
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_state_transition: 
State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE 
cause=C_FSA_INTERNAL origin=do_recover ]
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_DC_TIMER_STOP
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_INTEGRATE_TIMER_STOP
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_FINALIZE_TIMER_STOP
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_SHUTDOWN
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_shutdown: 
Disconnecting STONITH...
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: stonith_api_signoff: 
Signing out of the STONITH Service
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_LRM_DISCONNECT
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: verify_stopped: 
Checking for active resources before exit
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_lrm_control: 
Disconnected from the LRM
Jun 07 11:01:01 cluster_02 lrmd: [29310]: debug: on_receive_cmd: the IPC 
to client [pid:29313] disconnected.
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_CCM_DISCONNECT
Jun 07 11:01:01 cluster_02 lrmd: [29310]: debug: unregister_client: 
client crmd [pid:29313] is unregistered
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_HA_DISCONNECT
Jun 07 11:01:01 cluster_02 crmd: [29313]: notice: 
terminate_ais_connection: Disconnecting from AIS
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_ha_control: 
Disconnected from OpenAIS
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_CIB_STOP
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_cib_control: 
Disconnecting CIB
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: 
cib_client_del_notify_callback: Removing callback for cib_diff_notify events
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_STOP
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: do_fsa_action: 
actions:trace:     // A_EXIT_0
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: verify_stopped: 
Checking for active resources before exit
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_exit: Performing 
A_EXIT_0 - gracefully exiting the CRMd
Jun 07 11:01:01 cluster_02 crmd: [29313]: ERROR: do_exit: Could not 
recover from internal error
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: free_mem: Dropping 
I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: free_mem: Number of 
connected clients: 0
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: free_mem: Partial 
destroy: TE
Jun 07 11:01:01 cluster_02 crmd: [29313]: debug: free_mem: Partial 
destroy: PE
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: crm_xml_cleanup: 
Cleaning up memory from libxml2
Jun 07 11:01:01 cluster_02 crmd: [29313]: info: do_exit: [crmd] stopped (2)

I am aware that at this moment my stonith is not configured, reason is 
that I am not putting any configuration, simply running 2 servers with 
no load or resources to test the actual cluster. The slave machine dies.

this is my corosync.conf

compatibility: whitetank

totem {
    version: 2
    token: 3000
    token_retransmits_before_loss_const: 10
    join: 60
    consensus: 3600
    vsftype: none
    max_messages: 20
    clear_node_high_bit: yes
    secauth: off
    threads: 0
    rrp_mode: none
    interface {
        ringnumber: 0
        bindnetaddr: 10.10.0.0
        mcastaddr: 226.18.1.1
        mcastport: 6006
    }
}
service {
    ver:    1
    name:    pacemaker
}
aisexec {
    user: root
    group: root
}
logging {
    fileline: off
    to_stderr: yes
    to_logfile: yes
    to_syslog: yes
    logfile: /var/log/cluster/corosync.log
    debug: on
    timestamp: on
    logger_subsys {
        subsys: AMF
        debug: on
    }
}
amf {
    mode: disabled
}

Thanks in advance!

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss