I didn't read the whole thing but if your trying to do HA NFS, you need to run OCFS2 on your RBD and disable read/write caching on the rbd client.
From: "Steve Anthony" <sma310@xxxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Sent: Friday, December 25, 2015 12:39:01 AM
Subject: Re: nfs over rbd problem
To: ceph-users@xxxxxxxxxxxxxx
Sent: Friday, December 25, 2015 12:39:01 AM
Subject: Re: nfs over rbd problem
I've run into many problems trying to run RBD/NFS Pacemaker like you
describe on a two node cluster. In my case, most of the problems
were a result of a) no quorum and b) no STONITH. If you're going to
be running this setup in production, I *highly* recommend adding
more nodes (if only to maintain quorum). Without quorum, you will
run into split-brain situations where you'll see anything from both
nodes starting the same services to configuration changes
disappearing when the nodes can't agree on a recent revision.
In this case, it looks like a STONITH problem. Without a fencing mechanism, node2 cannot be 100% certain that node1 is dead. When you manually shutdown corosync, I suspect that as part of the process, node1 alerts the cluster that it's leaving as part of a planned shutdown. When it just disappears, node2 can't determine what happened, it just knows node1 isn't answering anymore. Node1 is not necessarily down, there could just be a network problem between node1 and node2. In such a scenario, it would be very bad for node2 to map/mount the RBDs and start writing data while node1 is still providing the same service. Quorum can help here too, but for node2 to be certain node1 is down, it needs an out-of-band method to force power off node1, eg. I use IPMI on a separate physical network.
So...add mode nodes as quorum members. You can set weight restrictions on the resource groups to prevent them from running on these nodes if the are not as powerful as the nodes you're using now. Then, add a STONITH mechanism on all the nodes, and verify it works.
Once you do that, you should see things act the way you expect. Good luck!
-Steve
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
In this case, it looks like a STONITH problem. Without a fencing mechanism, node2 cannot be 100% certain that node1 is dead. When you manually shutdown corosync, I suspect that as part of the process, node1 alerts the cluster that it's leaving as part of a planned shutdown. When it just disappears, node2 can't determine what happened, it just knows node1 isn't answering anymore. Node1 is not necessarily down, there could just be a network problem between node1 and node2. In such a scenario, it would be very bad for node2 to map/mount the RBDs and start writing data while node1 is still providing the same service. Quorum can help here too, but for node2 to be certain node1 is down, it needs an out-of-band method to force power off node1, eg. I use IPMI on a separate physical network.
So...add mode nodes as quorum members. You can set weight restrictions on the resource groups to prevent them from running on these nodes if the are not as powerful as the nodes you're using now. Then, add a STONITH mechanism on all the nodes, and verify it works.
Once you do that, you should see things act the way you expect. Good luck!
-Steve
On 12/19/2015 03:46 AM, maoqi1982
wrote:
Hi list
I have a test ceph cluster include 3 nodes (node0: mon, node1: osd and nfs server1, node2 osd and nfs server2).
os :centos6.6 ,kernel :3.10.94-1.el6.elrepo.x86_64, ceph version 0.94.5
I followed the http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/ instructions to setup an active/standy NFS environment.
when using commands " #service corosync stop or # poweroff " on node1 , the fail over switch situation went fine ( nfs server take over by node2). But when I testing the situation of cutting off the power of node1, the switch is failed.
1. [root@node1 ~]# crm status
Last updated: Fri Dec 18 17:14:19 2015
Last change: Fri Dec 18 17:13:29 2015
Stack: classic openais (with plugin)
Current DC: node1 - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured, 3 expected votes
8 Resources configured
Online: [ node1 node2 ]
Resource Group: g_rbd_share_1
p_rbd_map_1 (ocf::ceph:rbd.in): Started node1
p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started node1
p_export_rbd_1 (ocf::heartbeat:exportfs): Started node1
p_vip_1 (ocf::heartbeat:IPaddr): Started node1
Clone Set: clo_nfs [g_nfs]
Started: [ node1 node2 ]
2. [root@node1 ~]# service corosync stop
[root@node2 cluster]# crm status
Last updated: Fri Dec 18 17:14:59 2015
Last change: Fri Dec 18 17:13:29 2015
Stack: classic openais (with plugin)
Current DC: node2 - partition WITHOUT quorum
Version: 1.1.11-97629de
2 Nodes configured, 3 expected votes
8 Resources configured
Online: [ node2 ]
OFFLINE: [ node1 ]
Resource Group: g_rbd_share_1
p_rbd_map_1 (ocf::ceph:rbd.in): Started node2
p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started node2
p_export_rbd_1 (ocf::heartbeat:exportfs): Started node2
p_vip_1 (ocf::heartbeat:IPaddr): Started node2
Clone Set: clo_nfs [g_nfs]
Started: [ node2 ]
Stopped: [ node1 ]
3. cut off node1 power manually
[root@node2 cluster]# crm status
Last updated: Fri Dec 18 17:23:06 2015
Last change: Fri Dec 18 17:13:29 2015
Stack: classic openais (with plugin)
Current DC: node2 - partition WITHOUT quorum
Version: 1.1.11-97629de
2 Nodes configured, 3 expected votes
8 Resources configured
Online: [ node2 ]
OFFLINE: [ node1 ]
Clone Set: clo_nfs [g_nfs]
Started: [ node2 ]
Stopped: [ node1 ]
Failed actions:
p_rbd_map_1_start_0 on node2 'unknown error' (1): call=48, status=Timed Out, last-rc-change='Fri Dec 18 17:22:19 2015', queued=0ms, exec=20002ms
corosync.log:
Dec 18 17:22:19 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 668: memb=1, new=0, lost=1
Dec 18 17:22:19 corosync [pcmk ] info: pcmk_peer_update: memb: node2 1211279552
Dec 18 17:22:19 corosync [pcmk ] info: pcmk_peer_update: lost: node1 1194502336
Dec 18 17:22:19 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 668: memb=1, new=0, lost=0
Dec 18 17:22:19 corosync [pcmk ] info: pcmk_peer_update: MEMB: node2 1211279552
Dec 18 17:22:19 corosync [pcmk ] info: ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition
Dec 18 17:22:19 corosync [pcmk ] info: update_member: Node 1194502336/node1 is now: lost
Dec 18 17:22:19 [2695] node2 crmd: notice: plugin_handle_membership: Membership 668: quorum lost
Dec 18 17:22:19 [2690] node2 cib: notice: plugin_handle_membership: Membership 668: quorum lost
Dec 18 17:22:19 corosync [pcmk ] info: send_member_notification: Sending membership update 668 to 3 children
Dec 18 17:22:19 [2691] node2 stonith-ng: notice: plugin_handle_membership: Membership 668: quorum lost
Dec 18 17:22:19 [2695] node2 crmd: info: crm_update_peer_proc: plugin_handle_membership: Node node1[1194502336] - unknown is now lost
Dec 18 17:22:19 [2690] node2 cib: info: crm_update_peer_proc: plugin_handle_membership: Node node1[1194502336] - unknown is now lost
Dec 18 17:22:19 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 18 17:22:19 [2695] node2 crmd: info: peer_update_callback: Client node1/peer now has status [offline] (DC=true, changed=111312)
Dec 18 17:22:19 [2691] node2 stonith-ng: info: crm_update_peer_proc: plugin_handle_membership: Node node1[1194502336] - unknown is now lost
Dec 18 17:22:19 [2695] node2 crmd: info: peer_update_callback: Peer node1 left us
Dec 18 17:22:19 [2690] node2 cib: notice: crm_update_peer_state: plugin_handle_membership: Node node1[1194502336] - state is now lost (was member)
Dec 18 17:22:19 [2695] node2 crmd: warning: match_down_event: No match for shutdown action on node1
Dec 18 17:22:19 [2691] node2 stonith-ng: notice: crm_update_peer_state: plugin_handle_membership: Node node1[1194502336] - state is now lost (was member)
Dec 18 17:22:19 [2695] node2 crmd: notice: peer_update_callback: Stonith/shutdown of node1 not matched
Dec 18 17:22:19 [2695] node2 crmd: info: crm_update_peer_join: peer_update_callback: Node node1[1194502336] - join-2 phase 4 -> 0
Dec 18 17:22:19 [2695] node2 crmd: info: abort_transition_graph: Transition aborted: Node failure (source=peer_update_callback:237, 1)
Dec 18 17:22:19 [2695] node2 crmd: notice: crm_update_peer_state: plugin_handle_membership: Node node1[1194502336] - state is now lost (was member)
Dec 18 17:22:19 [2695] node2 crmd: info: peer_update_callback: node1 is now lost (was member)
Dec 18 17:22:19 [2695] node2 crmd: warning: match_down_event: No match for shutdown action on node1
Dec 18 17:22:19 corosync [CPG ] chosen downlist: sender r(0) ip(192.168.50.72) ; members(old:2 left:1)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/crmd/88)
Dec 18 17:22:19 [2695] node2 crmd: notice: peer_update_callback: Stonith/shutdown of node1 not matched
Dec 18 17:22:19 [2695] node2 crmd: info: abort_transition_graph: Transition aborted: Node failure (source=peer_update_callback:237, 1)
Dec 18 17:22:19 corosync [MAIN ] Completed service synchronization, ready to provide service.
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/crmd/89)
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: Diff: --- 0.69.157 2
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: Diff: +++ 0.69.158 (null)
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: + /cib: @num_updates=158
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: + /cib/status/node_state[@id='node1']: @crmd=offline, @crm-debug-origin=peer_update_callback
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=node2/crmd/88, version=0.69.158)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=node2/crmd/89, version=0.69.158)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section nodes to master (origin=local/crmd/90)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/crmd/91)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section cib to master (origin=local/crmd/92)
Dec 18 17:22:19 [2695] node2 crmd: info: crmd_cs_dispatch: Setting expected votes to 3
Dec 18 17:22:19 [2695] node2 crmd: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ]
Dec 18 17:22:19 [2695] node2 crmd: info: do_dc_join_offer_one: An unknown node joined - (re-)offer to any unconfirmed nodes
Dec 18 17:22:19 [2695] node2 crmd: info: join_make_offer: Making join offers based on membership 668
Dec 18 17:22:19 [2695] node2 crmd: info: join_make_offer: Skipping node2: already known 4
Dec 18 17:22:19 [2695] node2 crmd: info: join_make_offer: Not making an offer to node1: not active (lost)
Dec 18 17:22:19 [2695] node2 crmd: info: abort_transition_graph: Transition aborted: Peer Halt (source=do_te_invoke:158, 1)
Dec 18 17:22:19 [2695] node2 crmd: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Dec 18 17:22:19 [2695] node2 crmd: info: crmd_join_phase_log: join-2: node2=confirmed
Dec 18 17:22:19 [2695] node2 crmd: info: crmd_join_phase_log: join-2: node1=none
Dec 18 17:22:19 [2695] node2 crmd: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
Dec 18 17:22:19 [2695] node2 crmd: info: do_dc_join_offer_one: An unknown node joined - (re-)offer to any unconfirmed nodes
Dec 18 17:22:19 [2695] node2 crmd: info: join_make_offer: Skipping node2: already known 4
Dec 18 17:22:19 [2695] node2 crmd: info: join_make_offer: Not making an offer to node1: not active (lost)
Dec 18 17:22:19 [2695] node2 crmd: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Dec 18 17:22:19 [2695] node2 crmd: info: crmd_join_phase_log: join-2: node2=confirmed
Dec 18 17:22:19 [2695] node2 crmd: info: crmd_join_phase_log: join-2: node1=none
Dec 18 17:22:19 [2695] node2 crmd: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
Dec 18 17:22:19 [2695] node2 crmd: info: abort_transition_graph: Transition aborted: Peer Cancelled (source=do_te_invoke:151, 1)
Dec 18 17:22:19 [2693] node2 attrd: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Dec 18 17:22:19 [2693] node2 attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Completed cib_modify operation for section nodes: OK (rc=0, origin=node2/crmd/90, version=0.69.158)
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: Diff: --- 0.69.158 2
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: Diff: +++ 0.69.159 (null)
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: + /cib: @num_updates=159
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: + /cib/status/node_state[@id='node2']: @crm-debug-origin=post_cache_update
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: + /cib/status/node_state[@id='node1']: @in_ccm=false, @crm-debug-origin=post_cache_update
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=node2/crmd/91, version=0.69.159)
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: Diff: --- 0.69.159 2
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: Diff: +++ 0.69.160 (null)
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: + /cib: @num_updates=160, @have-quorum=0
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Completed cib_modify operation for section cib: OK (rc=0, origin=node2/crmd/92, version=0.69.160)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section crm_config to master (origin=local/crmd/94)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section nodes to master (origin=local/crmd/95)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/crmd/96)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section cib to master (origin=local/crmd/97)
Dec 18 17:22:19 [2693] node2 attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Completed cib_modify operation for section crm_config: OK (rc=0, origin=node2/crmd/94, version=0.69.160)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Completed cib_modify operation for section nodes: OK (rc=0, origin=node2/crmd/95, version=0.69.160)
Dec 18 17:22:19 [2694] node2 pengine: notice: unpack_config: On loss of CCM Quorum: Ignore
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: Diff: --- 0.69.160 2
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: Diff: +++ 0.69.161 (null)
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: + /cib: @num_updates=161
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: + /cib/status/node_state[@id='node2']: @crm-debug-origin=do_state_transition
Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: + /cib/status/node_state[@id='node1']: @crm-debug-origin=do_state_transition, @join=down
Dec 18 17:22:19 [2694] node2 pengine: info: determine_online_status: Node node2 is online
Dec 18 17:22:19 [2694] node2 pengine: info: group_print: Resource Group: g_rbd_share_1
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=node2/crmd/96, version=0.69.161)
Dec 18 17:22:19 [2694] node2 pengine: info: native_print: p_rbd_map_1 (ocf::ceph:rbd.in): Stopped
Dec 18 17:22:19 [2694] node2 pengine: info: native_print: p_fs_rbd_1 (ocf::heartbeat:Filesystem): Stopped
Dec 18 17:22:19 [2694] node2 pengine: info: native_print: p_export_rbd_1 (ocf::heartbeat:exportfs): Stopped
Dec 18 17:22:19 [2694] node2 pengine: info: native_print: p_vip_1 (ocf::heartbeat:IPaddr): Stopped
Dec 18 17:22:19 [2694] node2 pengine: info: clone_print: Clone Set: clo_nfs [g_nfs]
Dec 18 17:22:19 [2694] node2 pengine: info: short_print: Started: [ node2 ]
Dec 18 17:22:19 [2694] node2 pengine: info: short_print: Stopped: [ node1 ]
Dec 18 17:22:19 [2694] node2 pengine: info: rsc_merge_weights: p_rpcbind:1: Rolling back scores from p_nfs_server:1
Dec 18 17:22:19 [2694] node2 pengine: info: native_color: Resource p_rpcbind:1 cannot run anywhere
Dec 18 17:22:19 [2694] node2 pengine: info: native_color: Resource p_nfs_server:1 cannot run anywhere
Dec 18 17:22:19 [2694] node2 pengine: info: RecurringOp: Start recurring monitor (10s) for p_rbd_map_1 on node2
Dec 18 17:22:19 [2694] node2 pengine: info: RecurringOp: Start recurring monitor (20s) for p_fs_rbd_1 on node2
Dec 18 17:22:19 [2694] node2 pengine: info: RecurringOp: Start recurring monitor (10s) for p_export_rbd_1 on node2
Dec 18 17:22:19 [2694] node2 pengine: info: RecurringOp: Start recurring monitor (5s) for p_vip_1 on node2
Dec 18 17:22:19 [2694] node2 pengine: notice: LogActions: Start p_rbd_map_1 (node2)
Dec 18 17:22:19 [2694] node2 pengine: notice: LogActions: Start p_fs_rbd_1 (node2)
Dec 18 17:22:19 [2694] node2 pengine: notice: LogActions: Start p_export_rbd_1 (node2)
Dec 18 17:22:19 [2694] node2 pengine: notice: LogActions: Start p_vip_1 (node2)
Dec 18 17:22:19 [2694] node2 pengine: info: LogActions: Leave p_rpcbind:0 (Started node2)
Dec 18 17:22:19 [2694] node2 pengine: info: LogActions: Leave p_nfs_server:0 (Started node2)
Dec 18 17:22:19 [2694] node2 pengine: info: LogActions: Leave p_rpcbind:1 (Stopped)
Dec 18 17:22:19 [2694] node2 pengine: info: LogActions: Leave p_nfs_server:1 (Stopped)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Completed cib_modify operation for section cib: OK (rc=0, origin=node2/crmd/97, version=0.69.161)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/attrd/24)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=node2/attrd/24, version=0.69.161)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/attrd/26)
Dec 18 17:22:19 [2690] node2 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=node2/attrd/26, version=0.69.161)
Dec 18 17:22:19 [2694] node2 pengine: notice: process_pe_message: Calculated Transition 3: /var/lib/pacemaker/pengine/pe-input-234.bz2
Dec 18 17:22:19 [2695] node2 crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Dec 18 17:22:19 [2695] node2 crmd: info: do_te_invoke: Processing graph 3 (ref=pe_calc-dc-1450430539-52) derived from /var/lib/pacemaker/pengine/pe-input-234.bz2
Dec 18 17:22:19 [2695] node2 crmd: notice: te_rsc_command: Initiating action 6: start p_rbd_map_1_start_0 on node2 (local)
Dec 18 17:22:19 [2695] node2 crmd: info: do_lrm_rsc_op: Performing key=6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9 op=p_rbd_map_1_start_0
Dec 18 17:22:19 [2692] node2 lrmd: info: log_execute: executing - rsc:p_rbd_map_1 action:start call_id:48
Dec 18 17:22:24 [2690] node2 cib: info: cib_process_ping: Reporting our current digest to node2: 29875b5ae8dcbbb1d07f904ca3423de2 for 0.69.161 (0x12607b0 0)
Dec 18 17:22:39 [2692] node2 lrmd: warning: child_timeout_callback: p_rbd_map_1_start_0 process (PID 11010) timed out
Dec 18 17:22:39 [2692] node2 lrmd: warning: operation_finished: p_rbd_map_1_start_0:11010 - timed out after 20000ms
Dec 18 17:22:39 [2692] node2 lrmd: notice: operation_finished: p_rbd_map_1_start_0:11010:stderr [ libust[11019/11019]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305) ]
Dec 18 17:22:39 [2692] node2 lrmd: info: log_finished: finished - rsc:p_rbd_map_1 action:start call_id:48 pid:11010 exit-code:1 exec-time:20002ms queue-time:0ms
Dec 18 17:22:39 [2695] node2 crmd: info: services_os_action_execute: Managed rbd.in_meta-data_0 process 11117 exited with rc=0
Dec 18 17:22:39 [2695] node2 crmd: error: process_lrm_event: Operation p_rbd_map_1_start_0: Timed Out (node=node2, call=48, timeout=20000ms)
Dec 18 17:22:39 [2692] node2 lrmd: warning: child_timeout_callback: p_rbd_map_1_start_0 process (PID 11010) timed out
Dec 18 17:22:39 [2692] node2 lrmd: warning: operation_finished: p_rbd_map_1_start_0:11010 - timed out after 20000ms
Dec 18 17:22:39 [2692] node2 lrmd: notice: operation_finished: p_rbd_map_1_start_0:11010:stderr [ libust[11019/11019]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305) ]
Dec 18 17:22:39 [2692] node2 lrmd: info: log_finished: finished - rsc:p_rbd_map_1 action:start call_id:48 pid:11010 exit-code:1 exec-time:20002ms queue-time:0ms
Dec 18 17:22:39 [2695] node2 crmd: info: services_os_action_execute: Managed rbd.in_meta-data_0 process 11117 exited with rc=0
Dec 18 17:22:39 [2695] node2 crmd: error: process_lrm_event: Operation p_rbd_map_1_start_0: Timed Out (node=node2, call=48, timeout=20000ms)
Dec 18 17:22:39 [2695] node2 crmd: notice: process_lrm_event: node2-p_rbd_map_1_start_0:48 [ libust[11019/11019]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305)\n ]
Dec 18 17:22:39 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/crmd/99)
Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: Diff: --- 0.69.161 2
Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: Diff: +++ 0.69.162 (null)
Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: + /cib: @num_updates=162
Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: + /cib/status/node_state[@id='node2']: @crm-debug-origin=do_update_resource
Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: + /cib/status/node_state[@id='node2']/lrm[@id='node2']/lrm_resources/lrm_resource[@id='p_rbd_map_1']/lrm_rsc_op[@id='p_rbd_map_1_last_0']: @operation_key=p_rbd_map_1_start_0, @operation=start, @transition-key=6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, @transition-magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, @call-id=48, @rc-code=1, @op-status=2, @last-run=1450430539, @last-rc-change=1450430539, @exec-time=20002
Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: ++ /cib/status/node_state[@id='node2']/lrm[@id='node2']/lrm_resources/lrm_resource[@id='p_rbd_map_1']: <lrm_rsc_op id="p_rbd_map_1_last_failure_0" operation_key="p_rbd_map_1_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" transition-key="6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9" transition-magic="2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9" call-id="48" rc-code="1" op-status="2" interval="0" l
Dec 18 17:22:39 [2690] node2 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=node2/crmd/99, version=0.69.162)
Dec 18 17:22:39 [2695] node2 crmd: warning: status_from_rc: Action 6 (p_rbd_map_1_start_0) on node2 failed (target: 0 vs. rc: 1): Error
Dec 18 17:22:39 [2695] node2 crmd: warning: update_failcount: Updating failcount for p_rbd_map_1 on node2 after failed start: rc=1 (update=INFINITY, time=1450430559)
Dec 18 17:22:39 [2695] node2 crmd: notice: abort_transition_graph: Transition aborted by p_rbd_map_1_start_0 'modify' on node2: Event failed (magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, cib=0.69.162, source=match_graph_event:344, 0)
Dec 18 17:22:39 [2695] node2 crmd: info: match_graph_event: Action p_rbd_map_1_start_0 (6) confirmed on node2 (rc=4)
Dec 18 17:22:39 [2693] node2 attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-p_rbd_map_1 (INFINITY)
Dec 18 17:22:39 [2695] node2 crmd: warning: update_failcount: Updating failcount for p_rbd_map_1 on node2 after failed start: rc=1 (update=INFINITY, time=1450430559)
Dec 18 17:22:39 [2695] node2 crmd: info: process_graph_event: Detected action (3.6) p_rbd_map_1_start_0.48=unknown error: failed
Dec 18 17:22:39 [2695] node2 crmd: warning: status_from_rc: Action 6 (p_rbd_map_1_start_0) on node2 failed (target: 0 vs. rc: 1): Error
Dec 18 17:22:39 [2695] node2 crmd: warning: update_failcount: Updating failcount for p_rbd_map_1 on node2 after failed start: rc=1 (update=INFINITY, time=1450430559)
Dec 18 17:22:39 [2695] node2 crmd: info: abort_transition_graph: Transition aborted by p_rbd_map_1_start_0 'create' on (null): Event failed (magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, cib=0.69.162, source=match_graph_event:344, 0)
Dec 18 17:22:39 [2695] node2 crmd: info: match_graph_event: Action p_rbd_map_1_start_0 (6) confirmed on node2 (rc=4)
Dec 18 17:22:39 [2695] node2 crmd: warning: update_failcount: Updating failcount for p_rbd_map_1 on node2 after failed start: rc=1 (update=INFINITY, time=1450430559)
Dec 18 17:22:39 [2695] node2 crmd: info: process_graph_event: Detected action (3.6) p_rbd_map_1_start_0.48=unknown error: failed
Dec 18 17:22:39 [2693] node2 attrd: notice: attrd_perform_update: Sent update 28: fail-count-p_rbd_map_1=INFINITY
Dec 18 17:22:39 [2690] node2 cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/attrd/28)
Dec 18 17:22:39 [2695] node2 crmd: notice: run_graph: Transition 3 (Complete=2, Pending=0, Fired=0, Skipped=8, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-234.bz2): Stopped
Dec 18 17:22:39 [2695] node2 crmd: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Dec 18 17:22:39 [2693] node2 attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-p_rbd_map_1 (1450430559)
Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: Diff: --- 0.69.162 2
Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: Diff: +++ 0.69.163 (null)
Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: + /cib: @num_updates=163
Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: ++ /cib/status/node_state[@id='node2']/transient_attributes[@id='node2']/instance_attributes[@id='status-node2']: <nvpair id="status-node2-fail-count-p_rbd_map_1" name="fail-count-p_rbd_map_1" value="INFINITY"/>
.........
thanks
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- Steve Anthony LTS HPC Support Specialist Lehigh University sma310@xxxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com