Re: nfs over rbd problem

Steve Anthony <sma310@xxxxxxxxxx> · Fri, 25 Dec 2015 00:39:01 -0500



    I've run into many problems trying to run RBD/NFS Pacemaker like you
    describe on a two node cluster. In my case, most of the problems
    were a result of a) no quorum and b) no STONITH. If you're going to
    be running this setup in production, I *highly* recommend adding
    more nodes (if only to maintain quorum). Without quorum, you will
    run into split-brain situations where you'll see anything from both
    nodes starting the same services to configuration changes
    disappearing when the nodes can't agree on a recent revision.

    
    In this case, it looks like a STONITH problem. Without a fencing
    mechanism, node2 cannot be 100% certain that node1 is dead. When you
    manually shutdown corosync, I suspect that as part of the process,
    node1 alerts the cluster that it's leaving as part of a planned
    shutdown. When it just disappears, node2 can't determine what
    happened, it just knows node1 isn't answering anymore. Node1 is not
    necessarily down, there could just be a network problem between
    node1 and node2. In such a scenario, it would be very bad for node2
    to map/mount the RBDs and start writing data while node1 is still
    providing the same service. Quorum can help here too, but for node2
    to be certain node1 is down, it needs an out-of-band method to force
    power off node1, eg. I use IPMI on a separate physical network.

    
    So...add mode nodes as quorum members. You can set weight
    restrictions on the resource groups to prevent them from running on
    these nodes if the are not as powerful as the nodes you're using
    now. Then, add a STONITH mechanism on all the nodes, and verify it
    works.

    
    Once you do that, you should see things act the way you expect. Good
    luck!

    
    -Steve

    
    On 12/19/2015 03:46 AM, maoqi1982
      wrote:

    
          Hi list

            I have a test ceph cluster include 3 nodes (node0: mon,
            node1: osd and nfs server1, node2 osd and  nfs server2). 

            os :centos6.6 ,kernel :3.10.94-1.el6.elrepo.x86_64, ceph
            version 0.94.5 

            I followed the  http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/
            instructions to setup an active/standy  NFS environment. 

            when using commands " #service
              corosync stop or # poweroff " on node1 ,  the fail
            over switch situation went fine ( nfs server take over by
            node2). But when I testing the situation of cutting off the
            power of node1, the switch is failed.

            1. [root@node1 ~]# crm status

            Last updated: Fri Dec 18 17:14:19 2015

            Last change: Fri Dec 18 17:13:29 2015

            Stack: classic openais (with plugin)

            Current DC: node1 - partition with quorum

            Version: 1.1.11-97629de

            2 Nodes configured, 3 expected votes

            8 Resources configured

            Online: [ node1 node2 ]

             Resource Group: g_rbd_share_1

                 p_rbd_map_1        (ocf::ceph:rbd.in):     Started
            node1 

                 p_fs_rbd_1 (ocf::heartbeat:Filesystem):    Started
            node1 

                 p_export_rbd_1     (ocf::heartbeat:exportfs):     
            Started node1 

                 p_vip_1    (ocf::heartbeat:IPaddr):        Started
            node1 

             Clone Set: clo_nfs [g_nfs]

                 Started: [ node1 node2 ]

            2. [root@node1 ~]#　service corosync stop 

            [root@node2 cluster]# crm status

            Last updated: Fri Dec 18 17:14:59 2015

            Last change: Fri Dec 18 17:13:29 2015

            Stack: classic openais (with plugin)

            Current DC: node2 - partition WITHOUT quorum

            Version: 1.1.11-97629de

            2 Nodes configured, 3 expected votes

            8 Resources configured

            Online: [ node2 ]

            OFFLINE: [ node1 ]

             Resource Group: g_rbd_share_1

                 p_rbd_map_1        (ocf::ceph:rbd.in):     Started
            node2 

                 p_fs_rbd_1 (ocf::heartbeat:Filesystem):    Started
            node2 

                 p_export_rbd_1     (ocf::heartbeat:exportfs):     
            Started node2 

                 p_vip_1    (ocf::heartbeat:IPaddr):        Started
            node2 

             Clone Set: clo_nfs [g_nfs]

                 Started: [ node2 ]

                 Stopped: [ node1 ]

            
            3. cut off node1 power manually

            [root@node2 cluster]# crm status

            Last updated: Fri Dec 18 17:23:06 2015

            Last change: Fri Dec 18 17:13:29 2015

            Stack: classic openais (with plugin)

            Current DC: node2 - partition WITHOUT quorum

            Version: 1.1.11-97629de

            2 Nodes configured, 3 expected votes

            8 Resources configured

            Online: [ node2 ]

            OFFLINE: [ node1 ]

             Clone Set: clo_nfs [g_nfs]

                 Started: [ node2 ]

                 Stopped: [ node1 ]

            Failed actions:

                p_rbd_map_1_start_0 on node2 'unknown error' (1):
            call=48, status=Timed Out, last-rc-change='Fri Dec 18
            17:22:19 2015', queued=0ms, exec=20002ms

            corosync.log:

            Dec 18 17:22:19 corosync [pcmk  ] notice: pcmk_peer_update:
            Transitional membership event on ring 668: memb=1, new=0,
            lost=1

            Dec 18 17:22:19 corosync [pcmk  ] info: pcmk_peer_update:
            memb: node2 1211279552

            Dec 18 17:22:19 corosync [pcmk  ] info: pcmk_peer_update:
            lost: node1 1194502336

            Dec 18 17:22:19 corosync [pcmk  ] notice: pcmk_peer_update:
            Stable membership event on ring 668: memb=1, new=0, lost=0

            Dec 18 17:22:19 corosync [pcmk  ] info: pcmk_peer_update:
            MEMB: node2 1211279552

            Dec 18 17:22:19 corosync [pcmk  ] info:
            ais_mark_unseen_peer_dead: Node node1 was not seen in the
            previous transition

            Dec 18 17:22:19 corosync [pcmk  ] info: update_member: Node
            1194502336/node1 is now: lost

            Dec 18 17:22:19 [2695] node2       crmd:   notice:
            plugin_handle_membership:     Membership 668: quorum lost

            Dec 18 17:22:19 [2690] node2        cib:   notice:
            plugin_handle_membership:     Membership 668: quorum lost

            Dec 18 17:22:19 corosync [pcmk  ] info:
            send_member_notification: Sending membership update 668 to 3
            children

            Dec 18 17:22:19 [2691] node2 stonith-ng:   notice:
            plugin_handle_membership:     Membership 668: quorum lost

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            crm_update_peer_proc:     plugin_handle_membership: Node
            node1[1194502336] - unknown is now lost

            Dec 18 17:22:19 [2690] node2        cib:     info:
            crm_update_peer_proc:     plugin_handle_membership: Node
            node1[1194502336] - unknown is now lost

            Dec 18 17:22:19 corosync [TOTEM ] A processor joined or left
            the membership and a new membership was formed.

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            peer_update_callback:     Client node1/peer now has status
            [offline] (DC=true, changed=111312)

            Dec 18 17:22:19 [2691] node2 stonith-ng:     info:
            crm_update_peer_proc:     plugin_handle_membership: Node
            node1[1194502336] - unknown is now lost

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            peer_update_callback:     Peer node1 left us

            Dec 18 17:22:19 [2690] node2        cib:   notice:
            crm_update_peer_state:     plugin_handle_membership: Node
            node1[1194502336] - state is now lost (was member)

            Dec 18 17:22:19 [2695] node2       crmd:  warning:
            match_down_event:     No match for shutdown action on node1

            Dec 18 17:22:19 [2691] node2 stonith-ng:   notice:
            crm_update_peer_state:     plugin_handle_membership: Node
            node1[1194502336] - state is now lost (was member)

            Dec 18 17:22:19 [2695] node2       crmd:   notice:
            peer_update_callback:     Stonith/shutdown of node1 not
            matched

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            crm_update_peer_join:     peer_update_callback: Node
            node1[1194502336] - join-2 phase 4 -> 0

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            abort_transition_graph:     Transition aborted: Node failure
            (source=peer_update_callback:237, 1)

            Dec 18 17:22:19 [2695] node2       crmd:   notice:
            crm_update_peer_state:     plugin_handle_membership: Node
            node1[1194502336] - state is now lost (was member)

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            peer_update_callback:     node1 is now lost (was member)

            Dec 18 17:22:19 [2695] node2       crmd:  warning:
            match_down_event:     No match for shutdown action on node1

            Dec 18 17:22:19 corosync [CPG   ] chosen downlist: sender
            r(0) ip(192.168.50.72) ; members(old:2 left:1)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section status to master (origin=local/crmd/88)

            Dec 18 17:22:19 [2695] node2       crmd:   notice:
            peer_update_callback:     Stonith/shutdown of node1 not
            matched

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            abort_transition_graph:     Transition aborted: Node failure
            (source=peer_update_callback:237, 1)

            Dec 18 17:22:19 corosync [MAIN  ] Completed service
            synchronization, ready to provide service.

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section status to master (origin=local/crmd/89)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     Diff: --- 0.69.157 2

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     Diff: +++ 0.69.158 (null)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     +  /cib:  @num_updates=158

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     +  /cib/status/node_state[@id='node1']: 
            @crmd=offline, @crm-debug-origin=peer_update_callback

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Completed cib_modify operation for
            section status: OK (rc=0, origin=node2/crmd/88,
            version=0.69.158)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Completed cib_modify operation for
            section status: OK (rc=0, origin=node2/crmd/89,
            version=0.69.158)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section nodes to master (origin=local/crmd/90)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section status to master (origin=local/crmd/91)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section cib to master (origin=local/crmd/92)

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            crmd_cs_dispatch:     Setting expected votes to 3

            Dec 18 17:22:19 [2695] node2       crmd:   notice:
            do_state_transition:     State transition S_IDLE ->
            S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
            origin=check_join_state ]

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            do_dc_join_offer_one:     An unknown node joined -
            (re-)offer to any unconfirmed nodes

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            join_make_offer:     Making join offers based on membership
            668

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            join_make_offer:     Skipping node2: already known 4

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            join_make_offer:     Not making an offer to node1: not
            active (lost)

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            abort_transition_graph:     Transition aborted: Peer Halt
            (source=do_te_invoke:158, 1)

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            do_state_transition:     State transition S_INTEGRATION
            -> S_FINALIZE_JOIN [ input=I_INTEGRATED
            cause=C_FSA_INTERNAL origin=check_join_state ]

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            crmd_join_phase_log:     join-2: node2=confirmed

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            crmd_join_phase_log:     join-2: node1=none

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            do_state_transition:     State transition S_FINALIZE_JOIN
            -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
            origin=peer_update_callback ]

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            do_dc_join_offer_one:     An unknown node joined -
            (re-)offer to any unconfirmed nodes

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            join_make_offer:     Skipping node2: already known 4

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            join_make_offer:     Not making an offer to node1: not
            active (lost)

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            do_state_transition:     State transition S_INTEGRATION
            -> S_FINALIZE_JOIN [ input=I_INTEGRATED
            cause=C_FSA_INTERNAL origin=check_join_state ]

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            crmd_join_phase_log:     join-2: node2=confirmed

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            crmd_join_phase_log:     join-2: node1=none

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            do_state_transition:     State transition S_FINALIZE_JOIN
            -> S_POLICY_ENGINE [ input=I_FINALIZED
            cause=C_FSA_INTERNAL origin=check_join_state ]

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            abort_transition_graph:     Transition aborted: Peer
            Cancelled (source=do_te_invoke:151, 1)

            Dec 18 17:22:19 [2693] node2      attrd:   notice:
            attrd_local_callback:     Sending full refresh (origin=crmd)

            Dec 18 17:22:19 [2693] node2      attrd:   notice:
            attrd_trigger_update:     Sending flush op to all hosts for:
            shutdown (0)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Completed cib_modify operation for
            section nodes: OK (rc=0, origin=node2/crmd/90,
            version=0.69.158)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     Diff: --- 0.69.158 2

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     Diff: +++ 0.69.159 (null)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     +  /cib:  @num_updates=159

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     +  /cib/status/node_state[@id='node2']: 
            @crm-debug-origin=post_cache_update

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     +  /cib/status/node_state[@id='node1']: 
            @in_ccm=false, @crm-debug-origin=post_cache_update

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Completed cib_modify operation for
            section status: OK (rc=0, origin=node2/crmd/91,
            version=0.69.159)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     Diff: --- 0.69.159 2

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     Diff: +++ 0.69.160 (null)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     +  /cib:  @num_updates=160,
            @have-quorum=0

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Completed cib_modify operation for
            section cib: OK (rc=0, origin=node2/crmd/92,
            version=0.69.160)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section crm_config to master (origin=local/crmd/94)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section nodes to master (origin=local/crmd/95)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section status to master (origin=local/crmd/96)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section cib to master (origin=local/crmd/97)

            Dec 18 17:22:19 [2693] node2      attrd:   notice:
            attrd_trigger_update:     Sending flush op to all hosts for:
            probe_complete (true)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Completed cib_modify operation for
            section crm_config: OK (rc=0, origin=node2/crmd/94,
            version=0.69.160)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Completed cib_modify operation for
            section nodes: OK (rc=0, origin=node2/crmd/95,
            version=0.69.160)

            Dec 18 17:22:19 [2694] node2    pengine:   notice:
            unpack_config:     On loss of CCM Quorum: Ignore

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     Diff: --- 0.69.160 2

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     Diff: +++ 0.69.161 (null)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     +  /cib:  @num_updates=161

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     +  /cib/status/node_state[@id='node2']: 
            @crm-debug-origin=do_state_transition

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_perform_op:     +  /cib/status/node_state[@id='node1']: 
            @crm-debug-origin=do_state_transition, @join=down

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            determine_online_status:     Node node2 is online

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            group_print:      Resource Group: g_rbd_share_1

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Completed cib_modify operation for
            section status: OK (rc=0, origin=node2/crmd/96,
            version=0.69.161)

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            native_print:          p_rbd_map_1    (ocf::ceph:rbd.in):   
            Stopped 

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            native_print:          p_fs_rbd_1   
            (ocf::heartbeat:Filesystem):    Stopped 

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            native_print:          p_export_rbd_1   
            (ocf::heartbeat:exportfs):    Stopped 

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            native_print:          p_vip_1   
            (ocf::heartbeat:IPaddr):    Stopped 

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            clone_print:      Clone Set: clo_nfs [g_nfs]

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            short_print:          Started: [ node2 ]

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            short_print:          Stopped: [ node1 ]

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            rsc_merge_weights:     p_rpcbind:1: Rolling back scores from
            p_nfs_server:1

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            native_color:     Resource p_rpcbind:1 cannot run anywhere

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            native_color:     Resource p_nfs_server:1 cannot run
            anywhere

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            RecurringOp:      Start recurring monitor (10s) for
            p_rbd_map_1 on node2

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            RecurringOp:      Start recurring monitor (20s) for
            p_fs_rbd_1 on node2

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            RecurringOp:      Start recurring monitor (10s) for
            p_export_rbd_1 on node2

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            RecurringOp:      Start recurring monitor (5s) for p_vip_1
            on node2

            Dec 18 17:22:19 [2694] node2    pengine:   notice:
            LogActions:     Start   p_rbd_map_1    (node2)

            Dec 18 17:22:19 [2694] node2    pengine:   notice:
            LogActions:     Start   p_fs_rbd_1    (node2)

            Dec 18 17:22:19 [2694] node2    pengine:   notice:
            LogActions:     Start   p_export_rbd_1    (node2)

            Dec 18 17:22:19 [2694] node2    pengine:   notice:
            LogActions:     Start   p_vip_1    (node2)

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            LogActions:     Leave   p_rpcbind:0    (Started node2)

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            LogActions:     Leave   p_nfs_server:0    (Started node2)

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            LogActions:     Leave   p_rpcbind:1    (Stopped)

            Dec 18 17:22:19 [2694] node2    pengine:     info:
            LogActions:     Leave   p_nfs_server:1    (Stopped)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Completed cib_modify operation for
            section cib: OK (rc=0, origin=node2/crmd/97,
            version=0.69.161)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section status to master (origin=local/attrd/24)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Completed cib_modify operation for
            section status: OK (rc=0, origin=node2/attrd/24,
            version=0.69.161)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section status to master (origin=local/attrd/26)

            Dec 18 17:22:19 [2690] node2        cib:     info:
            cib_process_request:     Completed cib_modify operation for
            section status: OK (rc=0, origin=node2/attrd/26,
            version=0.69.161)

            Dec 18 17:22:19 [2694] node2    pengine:   notice:
            process_pe_message:     Calculated Transition 3:
            /var/lib/pacemaker/pengine/pe-input-234.bz2

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            do_state_transition:     State transition S_POLICY_ENGINE
            -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
            cause=C_IPC_MESSAGE origin=handle_response ]

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            do_te_invoke:     Processing graph 3
            (ref=pe_calc-dc-1450430539-52) derived from
            /var/lib/pacemaker/pengine/pe-input-234.bz2

            Dec 18 17:22:19 [2695] node2       crmd:   notice:
            te_rsc_command:     Initiating action 6: start
            p_rbd_map_1_start_0 on node2 (local)

            Dec 18 17:22:19 [2695] node2       crmd:     info:
            do_lrm_rsc_op:     Performing
            key=6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9
            op=p_rbd_map_1_start_0

            Dec 18 17:22:19 [2692] node2       lrmd:     info:
            log_execute:     executing - rsc:p_rbd_map_1 action:start
            call_id:48

            Dec 18 17:22:24 [2690] node2        cib:     info:
            cib_process_ping:     Reporting our current digest to node2:
            29875b5ae8dcbbb1d07f904ca3423de2 for 0.69.161 (0x12607b0 0)

            Dec 18 17:22:39 [2692] node2       lrmd:  warning:
            child_timeout_callback:     p_rbd_map_1_start_0 process (PID
            11010) timed out

            Dec 18 17:22:39 [2692] node2       lrmd:  warning:
            operation_finished:     p_rbd_map_1_start_0:11010 - timed
            out after 20000ms

            Dec 18 17:22:39 [2692] node2       lrmd:   notice:
            operation_finished:     p_rbd_map_1_start_0:11010:stderr [
            libust[11019/11019]: Warning: HOME environment variable not
            set. Disabling LTTng-UST per-user tracing. (in
            setup_local_apps() at lttng-ust-comm.c:305) ]

            Dec 18 17:22:39 [2692] node2       lrmd:     info:
            log_finished:     finished - rsc:p_rbd_map_1 action:start
            call_id:48 pid:11010 exit-code:1 exec-time:20002ms
            queue-time:0ms

            Dec 18 17:22:39 [2695] node2       crmd:     info:
            services_os_action_execute:     Managed rbd.in_meta-data_0
            process 11117 exited with rc=0

            Dec 18 17:22:39 [2695] node2       crmd:    error:
            process_lrm_event:     Operation p_rbd_map_1_start_0: Timed
            Out (node=node2, call=48, timeout=20000ms)

            
            Dec 18 17:22:39 [2692] node2       lrmd:  warning:
            child_timeout_callback:     p_rbd_map_1_start_0 process (PID
            11010) timed out

            Dec 18 17:22:39 [2692] node2       lrmd:  warning:
            operation_finished:     p_rbd_map_1_start_0:11010 - timed
            out after 20000ms

            Dec 18 17:22:39 [2692] node2       lrmd:   notice:
            operation_finished:     p_rbd_map_1_start_0:11010:stderr [
            libust[11019/11019]: Warning: HOME environment variable not
            set. Disabling LTTng-UST per-user tracing. (in
            setup_local_apps() at lttng-ust-comm.c:305) ]

            Dec 18 17:22:39 [2692] node2       lrmd:     info:
            log_finished:     finished - rsc:p_rbd_map_1 action:start
            call_id:48 pid:11010 exit-code:1 exec-time:20002ms
            queue-time:0ms

            Dec 18 17:22:39 [2695] node2       crmd:     info:
            services_os_action_execute:     Managed rbd.in_meta-data_0
            process 11117 exited with rc=0

            Dec 18 17:22:39 [2695] node2       crmd:    error:
            process_lrm_event:     Operation p_rbd_map_1_start_0: Timed
            Out (node=node2, call=48, timeout=20000ms)

            Dec 18 17:22:39 [2695] node2       crmd:   notice:
            process_lrm_event:     node2-p_rbd_map_1_start_0:48 [
            libust[11019/11019]: Warning: HOME environment variable not
            set. Disabling LTTng-UST per-user tracing. (in
            setup_local_apps() at lttng-ust-comm.c:305)\n ]

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section status to master (origin=local/crmd/99)

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_perform_op:     Diff: --- 0.69.161 2

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_perform_op:     Diff: +++ 0.69.162 (null)

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_perform_op:     +  /cib:  @num_updates=162

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_perform_op:     +  /cib/status/node_state[@id='node2']: 
            @crm-debug-origin=do_update_resource

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_perform_op:     + 
            /cib/status/node_state[@id='node2']/lrm[@id='node2']/lrm_resources/lrm_resource[@id='p_rbd_map_1']/lrm_rsc_op[@id='p_rbd_map_1_last_0']: 
            @operation_key=p_rbd_map_1_start_0, @operation=start,
            @transition-key=6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9,
            @transition-magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9,
            @call-id=48, @rc-code=1, @op-status=2, @last-run=1450430539,
            @last-rc-change=1450430539, @exec-time=20002

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_perform_op:     ++
            /cib/status/node_state[@id='node2']/lrm[@id='node2']/lrm_resources/lrm_resource[@id='p_rbd_map_1']: 
            <lrm_rsc_op id="p_rbd_map_1_last_failure_0"
            operation_key="p_rbd_map_1_start_0" operation="start"
            crm-debug-origin="do_update_resource"
            crm_feature_set="3.0.9"
            transition-key="6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9"
            transition-magic="2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9"
            call-id="48" rc-code="1" op-status="2" interval="0" l

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_process_request:     Completed cib_modify operation for
            section status: OK (rc=0, origin=node2/crmd/99,
            version=0.69.162)

            Dec 18 17:22:39 [2695] node2       crmd:  warning:
            status_from_rc:     Action 6 (p_rbd_map_1_start_0) on node2
            failed (target: 0 vs. rc: 1): Error

            Dec 18 17:22:39 [2695] node2       crmd:  warning:
            update_failcount:     Updating failcount for p_rbd_map_1 on
            node2 after failed start: rc=1 (update=INFINITY,
            time=1450430559)

            Dec 18 17:22:39 [2695] node2       crmd:   notice:
            abort_transition_graph:     Transition aborted by
            p_rbd_map_1_start_0 'modify' on node2: Event failed
            (magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9,
            cib=0.69.162, source=match_graph_event:344, 0)

            Dec 18 17:22:39 [2695] node2       crmd:     info:
            match_graph_event:     Action p_rbd_map_1_start_0 (6)
            confirmed on node2 (rc=4)

            Dec 18 17:22:39 [2693] node2      attrd:   notice:
            attrd_trigger_update:     Sending flush op to all hosts for:
            fail-count-p_rbd_map_1 (INFINITY)

            Dec 18 17:22:39 [2695] node2       crmd:  warning:
            update_failcount:     Updating failcount for p_rbd_map_1 on
            node2 after failed start: rc=1 (update=INFINITY,
            time=1450430559)

            Dec 18 17:22:39 [2695] node2       crmd:     info:
            process_graph_event:     Detected action (3.6)
            p_rbd_map_1_start_0.48=unknown error: failed

            Dec 18 17:22:39 [2695] node2       crmd:  warning:
            status_from_rc:     Action 6 (p_rbd_map_1_start_0) on node2
            failed (target: 0 vs. rc: 1): Error

            Dec 18 17:22:39 [2695] node2       crmd:  warning:
            update_failcount:     Updating failcount for p_rbd_map_1 on
            node2 after failed start: rc=1 (update=INFINITY,
            time=1450430559)

            Dec 18 17:22:39 [2695] node2       crmd:     info:
            abort_transition_graph:     Transition aborted by
            p_rbd_map_1_start_0 'create' on (null): Event failed
            (magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9,
            cib=0.69.162, source=match_graph_event:344, 0)

            Dec 18 17:22:39 [2695] node2       crmd:     info:
            match_graph_event:     Action p_rbd_map_1_start_0 (6)
            confirmed on node2 (rc=4)

            Dec 18 17:22:39 [2695] node2       crmd:  warning:
            update_failcount:     Updating failcount for p_rbd_map_1 on
            node2 after failed start: rc=1 (update=INFINITY,
            time=1450430559)

            Dec 18 17:22:39 [2695] node2       crmd:     info:
            process_graph_event:     Detected action (3.6)
            p_rbd_map_1_start_0.48=unknown error: failed

            Dec 18 17:22:39 [2693] node2      attrd:   notice:
            attrd_perform_update:     Sent update 28:
            fail-count-p_rbd_map_1=INFINITY

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_process_request:     Forwarding cib_modify operation for
            section status to master (origin=local/attrd/28)

            Dec 18 17:22:39 [2695] node2       crmd:   notice:
            run_graph:     Transition 3 (Complete=2, Pending=0, Fired=0,
            Skipped=8, Incomplete=0,
            Source=/var/lib/pacemaker/pengine/pe-input-234.bz2): Stopped

            Dec 18 17:22:39 [2695] node2       crmd:     info:
            do_state_transition:     State transition
            S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
            cause=C_FSA_INTERNAL origin=notify_crmd ]

            Dec 18 17:22:39 [2693] node2      attrd:   notice:
            attrd_trigger_update:     Sending flush op to all hosts for:
            last-failure-p_rbd_map_1 (1450430559)

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_perform_op:     Diff: --- 0.69.162 2

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_perform_op:     Diff: +++ 0.69.163 (null)

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_perform_op:     +  /cib:  @num_updates=163

            Dec 18 17:22:39 [2690] node2        cib:     info:
            cib_perform_op:     ++
            /cib/status/node_state[@id='node2']/transient_attributes[@id='node2']/instance_attributes[@id='status-node2']: 
            <nvpair id="status-node2-fail-count-p_rbd_map_1"
            name="fail-count-p_rbd_map_1" value="INFINITY"/>

            .........

            
            thanks 

            
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
    -- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma310@xxxxxxxxxx
  

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com