Re: shutdown of corosync-notifyd results in shutdown of pacemaker

Andrew Beekhof <andrew@xxxxxxxxxxx> · Fri, 12 Oct 2012 10:58:07 +1100

More specifically, stopping corosync-notifyd results in all
Pacemaker's connections to Corosync being terminated.
Andreas:  Did you test this on linux or solaris only?

On Thu, Oct 11, 2012 at 11:45 PM, Grüninger, Andreas (LGL Extern)
<Andreas.Grueninger@xxxxxxxxxx> wrote:
> When I start
> corosync-notifyd -f -l -s -m <MONITORINGSERVER>
> and close it with CTRL-C, pacemaker make a shutdown.
> Please see below for the details.
>
> I compiled the current master of corosync (tag 2.1.0)  and the current master of pacemaker.
> The OS is Solaris 11U7.
>
> Is this a feature or a bug?
> In Solaris libqb must be patched to avoid errors.
> Please see
> https://lists.fedorahosted.org/pipermail/quarterback-devel/2012-September/000921.html "[PATCH] -ENOTCONN handled as error when client disconnects"
> Maybe this patch should not deliver -ESHUTDOWN when a client disconnects.
> IMHO this is the adaequate result.
>
> Andreas
>
>
> On Thu, Oct 4, 2012 at 5:57 PM, Grüninger, Andreas (LGL Extern) <Andreas.Grueninger@xxxxxxxxxx> wrote:
>>>> Is this an error or the desired result?
>>
>>>Based on the logs, pacemaker thinks corosync died.  Did that happen?
>>>If so there is not much pacemaker can do :-(
>>
>> And that is absolutely ok when corosync dies.
>> Corosync does not die but is still healthy.
>> It is corosync-notifyd which is started additionally to corosync as a separate process and which is finished with kill as daemon or with ctrl-c as foreground process.
>> The job of corosync-notifyd is sending of SNMP traps.
>> This is the functionality of crm_mon -C .. -S ... for pacemaker.
>>
>> So corosync-notifyd sends the wrong signal or pacemaker does a little bit too much.
>> Pacemaker should just ignore this ending connection.
>
> All the Pacemaker daemons are being told, by Corosync itself, that their connections to Corosync are dead.
> Its a little difficult to ignore that.
>
>> Is there a chance in pacemaker or should should this better solved in corosync/corosync-notifyd?
>
> It needs to be addressed in corosync/corosync-notifyd.
> Corosync's CPG library is the one invoking our
> cpg_connection_destroy() callback.
>
>>
>> Andreas
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Andrew Beekhof [mailto:andrew@xxxxxxxxxxx]
>> Gesendet: Mittwoch, 3. Oktober 2012 01:09
>> An: The Pacemaker cluster resource manager
>> Betreff: Re: [Pacemaker] Exiting corosync-notifyd results in shutting
>> downof pacemakerd
>>
>> On Wed, Oct 3, 2012 at 2:51 AM, Grüninger, Andreas (LGL Extern) <Andreas.Grueninger@xxxxxxxxxx> wrote:
>>> I am currently investigating the monitoring of corosync/pacemaker with snmp.
>>> crm_mon used with the OCF resource ClusterMon works as it should.
>>>
>>> But corosync-notifyd can't be used in our case.
>>> I start corosync-notifyd in the foreground as follows
>>> corosync-notifyd -f -l -s  -m 10.50.235.1
>>>
>>> When I stop the running corosync-notifyd with CTRL-C, pacemaker shuts down with the following entries in the logfile.
>>> Is this an error or the desired result?
>>
>> Based on the logs, pacemaker thinks corosync died.  Did that happen?
>> If so there is not much pacemaker can do :-(
>>
>>>
>>> ....
>>> Oct 02 18:42:19 [27126] pacemakerd:    error: cfg_connection_destroy:   Connection destroyed
>>> Oct 02 18:42:19 [27126] pacemakerd:   notice: pcmk_shutdown_worker:     Shuting down Pacemaker
>>> Oct 02 18:42:19 [27126] pacemakerd:   notice: stop_child:       Stopping crmd: Sent -15 to process 27177
>>> Oct 02 18:42:19 [27126] pacemakerd:    error: cpg_connection_destroy:   Connection destroyed
>>> Oct 02 18:42:19 [27177]       crmd:     info: crm_signal_dispatch:      Invoking handler for signal 15: Terminated
>>> Oct 02 18:42:19 [27177]       crmd:   notice: crm_shutdown:     Requesting shutdown, upper limit is 1200000ms
>>> Oct 02 18:42:19 [27128] stonith-ng:    error: pcmk_cpg_dispatch:        Connection to the CPG API failed: 2
>>> Oct 02 18:42:19 [27177]       crmd:     info: do_shutdown_req:  Sending shutdown request to zd-sol-s1-v61
>>> Oct 02 18:42:19 [27128] stonith-ng:    error: stonith_peer_ais_destroy:         AIS connection terminated
>>> Oct 02 18:42:19 [27128] stonith-ng:     info: stonith_shutdown:         Terminating with  1 clients
>>> Oct 02 18:42:19 [27130]      attrd:    error: pcmk_cpg_dispatch:        Connection to the CPG API failed: 2
>>> Oct 02 18:42:19 [27130]      attrd:     crit: attrd_ais_destroy:        Lost connection to Corosync service!
>>> Oct 02 18:42:19 [27130]      attrd:   notice: main:     Exiting...
>>> Oct 02 18:42:19 [27130]      attrd:   notice: main:     Disconnecting client 81ffc38, pid=27177...
>>> Oct 02 18:42:19 [27128] stonith-ng:     info: qb_ipcs_us_withdraw:      withdrawing server sockets
>>> Oct 02 18:42:19 [27128] stonith-ng:     info: crm_xml_cleanup:  Cleaning up memory from libxml2
>>> Oct 02 18:42:19 [27130]      attrd:    error: attrd_cib_connection_destroy:     Connection to the CIB terminated...
>>> Oct 02 18:42:19 [27127]        cib:    error: pcmk_cpg_dispatch:        Connection to the CPG API failed: 2
>>> Oct 02 18:42:19 [27127]        cib:    error: cib_ais_destroy:  Corosync connection lost!  Exiting.
>>> Oct 02 18:42:19 [27129]       lrmd:     info: lrmd_ipc_destroy:         LRMD client disconnecting 807e768 - name: crmd id: 1d659f61-d6e2-4ef3-f674-b9a8ba8029e8
>>> Oct 02 18:42:19 [27127]        cib:     info: terminate_cib:    cib_ais_destroy: Exiting fast...
>>> Oct 02 18:42:19 [27127]        cib:     info: qb_ipcs_us_withdraw:      withdrawing server sockets
>>> Oct 02 18:42:19 [27127]        cib:     info: qb_ipcs_us_withdraw:      withdrawing server sockets
>>> Oct 02 18:42:19 [27127]        cib:     info: qb_ipcs_us_withdraw:      withdrawing server sockets
>>> Oct 02 18:42:19 [27126] pacemakerd:    error: pcmk_child_exit:  Child process attrd exited (pid=27130, rc=1)
>>> Oct 02 18:42:19 [27126] pacemakerd:    error: send_cpg_message:         Sending message via cpg FAILED: (rc=9) Bad handle
>>> Oct 02 18:42:19 [27126] pacemakerd:    error: pcmk_child_exit:  Child process cib exited (pid=27127, rc=64)
>>> Oct 02 18:42:19 [27126] pacemakerd:    error: send_cpg_message:         Sending message via cpg FAILED: (rc=9) Bad handle
>>> Oct 02 18:42:19 [27126] pacemakerd:   notice: pcmk_child_exit:  Child process crmd terminated with signal 13 (pid=27177, core=0)
>>> Oct 02 18:42:19 [27126] pacemakerd:    error: send_cpg_message:         Sending message via cpg FAILED: (rc=9) Bad handle
>>> Oct 02 18:42:19 [27126] pacemakerd:   notice: stop_child:       Stopping pengine: Sent -15 to process 27131
>>> Oct 02 18:42:19 [27126] pacemakerd:     info: pcmk_child_exit:  Child process pengine exited (pid=27131, rc=0)
>>> Oct 02 18:42:19 [27126] pacemakerd:    error: send_cpg_message:         Sending message via cpg FAILED: (rc=9) Bad handle
>>> Oct 02 18:42:19 [27126] pacemakerd:   notice: stop_child:       Stopping lrmd: Sent -15 to process 27129
>>> Oct 02 18:42:19 [27129]       lrmd:     info: crm_signal_dispatch:      Invoking handler for signal 15: Terminated
>>> Oct 02 18:42:19 [27129]       lrmd:     info: lrmd_shutdown:    Terminating with  0 clients
>>> Oct 02 18:42:19 [27129]       lrmd:     info: qb_ipcs_us_withdraw:      withdrawing server sockets
>>> Oct 02 18:42:19 [27126] pacemakerd:     info: pcmk_child_exit:  Child process lrmd exited (pid=27129, rc=0)
>>> Oct 02 18:42:19 [27126] pacemakerd:    error: send_cpg_message:         Sending message via cpg FAILED: (rc=9) Bad handle
>>> Oct 02 18:42:19 [27126] pacemakerd:   notice: stop_child:       Stopping stonith-ng: Sent -15 to process 27128
>>> Oct 02 18:42:19 [27126] pacemakerd:   notice: pcmk_child_exit:  Child process stonith-ng terminated with signal 11 (pid=27128, core=128)
>>> Oct 02 18:42:19 [27126] pacemakerd:    error: send_cpg_message:         Sending message via cpg FAILED: (rc=9) Bad handle
>>> Oct 02 18:42:19 [27126] pacemakerd:   notice: pcmk_shutdown_worker:     Shutdown complete
>>> Oct 02 18:42:19 [27126] pacemakerd:     info: qb_ipcs_us_withdraw:      withdrawing server sockets
>>> Oct 02 18:42:19 [27126] pacemakerd:     info: main:     Exiting pacemakerd
>>>
>>> Andreas
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@xxxxxxxxxxxxxxxxxxx
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@xxxxxxxxxxxxxxxxxxx
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@xxxxxxxxxxxxxxxxxxx
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@xxxxxxxxxxxxxxxxxxx http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss