Re: pacemaker "CPG API: failed Library error"

Alessandro Bono <alessandro.bono@xxxxxxxxx> · Mon, 10 Feb 2014 16:10:57 +0100

Il 10/02/14 15:55, Jan Friesse ha scritto:
Alessandro Bono napsal(a):
Il 10/02/14 13:55, Jan Friesse ha scritto:
Alessandro Bono napsal(a):
Il 10/02/14 12:24, Jan Friesse ha scritto:
Alessandro Bono napsal(a):
Il 10/02/14 10:47, Jan Friesse ha scritto:
Alessandro,
can you find message like "Corosync main process was not scheduled
for
... ms" in log file (corosync must be at least 1.4.1-16 so CentOS
6.5)?
Hi

there is no a message like that in log file
distro is centos 6.5
ok. So first thing to try is to remove redundant ring (just remove
altname tags) and see, if problem is still existing. If so, give a try
standard multicast (so remove udpu) but make sure to enable
multicast_querier (echo 1 >
/sys/class/net/$NETWORK_IFACE/bridge/multicast_querier, I'm using
libvirt qemu hook (/etc/libvirt/hooks/qemu) for that).
redudant ring and udpu are a tentative to workaround the problem
this library error was first seen on a configuration without these
parameters
I have same problem on two cluster with similar node configuration
Ok. Can you please then paste logs from single ring multicast
configuration (ideally centos 6.5)?
I have to find some old log on my tape backup, not easy now
I have  a full debug file but for centos 6.4 it's 164k, I'll send you
offlist
Ok, but I would still really like to see log from 6.5 (there were huge
amount of fixes for 6.5).
I don't have a log with centos 6.5, I reconfigured cluster as you requested

<cluster config_version="8" name="ga-ext_cluster">
  <logging>
   <logging_daemon name="corosync" debug="on"/>
  </logging>
  <clusternodes>
    <clusternode name="ga1-ext" nodeid="1">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="ga1-ext"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="ga2-ext" nodeid="2">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="ga2-ext"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice agent="fence_pcmk" name="pcmk"/>
  </fencedevices>
</cluster>

tonight I'll force a full backup to cause corosync error and I'll send 
you log file

But from log I've seen no bigger problem. I mean, there were 1 split
caused probably by the fact that VM was not scheduled (22:26:25). Other
then that, log looks quite ok. There is really no universal solution for
VM not scheduled scenario. You can lower priority of backup
script/higher priority for VM, pin VM on CPU, ... but it may or may be
not help.
Ok but on the some machines there is another cluster with old ubuntu vm 
with corosync 1.4.2 and it's rock solid
also prior to switch to cman cluster worked perfectly
to workaround maybe I'll put cluster in maintance mode and stop 
pacemaker before start backup, but it's not a nice thing
Honza

Regards,
    Honza

Regards,
     Honza

rpm -qa corosync
corosync-1.4.1-17.el6.x86_64

Regards,
      Honza

Alessandro Bono napsal(a):
Hi

after changing cluster from corosync to cman+corosync (switching
from
centos 6.3 to 6.4) I have a recurring problem with
pacemaker/corosync
pacemaker report this error

pacemakerd:    error: pcmk_cpg_dispatch:     Connection to the
CPG API
failed: Library error (2)

and shutdown itself
This normally happens when the host machine is under high load, at
example during a full backup

in addition, there are a lot of these messages

Feb 01 23:27:04 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface
10.12.32.1
FAULTY
Feb 01 23:27:07 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:07 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface
10.12.23.1
FAULTY
Feb 01 23:27:10 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:10 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface
10.12.23.1
FAULTY
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface
10.12.32.1
FAULTY
Feb 01 23:27:13 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 23:27:13 corosync [TOTEM ] received message requesting
test of
ring now active

I reported this problem to pacemaker ml but they said it's a
corosync
problem
same problem with centos 6.5

I tried to switch comunication to udpu and add another comunication
path
but without any luck
cluster nodes are kvm virtual machine

Is it a configuration problem?

some info below, I can provide full log if necessary

rpm -qa  | egrep "pacem|coro"| sort
corosync-1.4.1-17.el6.x86_64
corosynclib-1.4.1-17.el6.x86_64
drbd-pacemaker-8.3.16-1.el6.x86_64
pacemaker-1.1.10-14.el6_5.2.x86_64
pacemaker-cli-1.1.10-14.el6_5.2.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64
pacemaker-debuginfo-1.1.10-1.el6.x86_64
pacemaker-libs-1.1.10-14.el6_5.2.x86_64

cat /etc/cluster/cluster.conf
<cluster config_version="8" name="ga-ext_cluster">
<cman transport="udpu"/>
      <logging>
       <logging_daemon name="corosync" debug="on"/>
      </logging>
      <clusternodes>
        <clusternode name="ga1-ext" nodeid="1">
          <fence>
            <method name="pcmk-redirect">
              <device name="pcmk" port="ga1-ext"/>
            </method>
          </fence>
          <altname name="ga1-ext_alt"/>
        </clusternode>
        <clusternode name="ga2-ext" nodeid="2">
          <fence>
            <method name="pcmk-redirect">
              <device name="pcmk" port="ga2-ext"/>
            </method>
          </fence>
          <altname name="ga2-ext_alt"/>
        </clusternode>
      </clusternodes>
      <fencedevices>
        <fencedevice agent="fence_pcmk" name="pcmk"/>
      </fencedevices>
</cluster>

crm configure show
node ga1-ext \
        attributes standby="off"
node ga2-ext \
        attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr \
        params ip="10.12.23.3" cidr_netmask="24" \
        op monitor interval="30s"
primitive SharedFS ocf:heartbeat:Filesystem \
        params device="/dev/drbd/by-res/r0" directory="/shared"
fstype="ext4" options="noatime,nobarrier"
primitive dovecot lsb:dovecot
primitive drbd0 ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="15s"
primitive drbdlinks ocf:tummy:drbdlinks
primitive mail ocf:heartbeat:MailTo \
        params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext
cluster
- "
primitive mysql lsb:mysqld
group service_group SharedFS drbdlinks ClusterIP mail mysql
dovecot \
        meta target-role="Started"
ms ms_drbd0 drbd0 \
        meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation service_on_drbd inf: service_group ms_drbd0:Master
order service_after_drbd inf: ms_drbd0:promote service_group:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.10-14.el6_5.2-368c726" \
        cluster-infrastructure="cman" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1391290945" \
        maintenance-mode="false"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"

extract from cluster.log

Feb 01 21:40:15 corosync [MAIN  ] Completed service synchronization,
ready to provide service.
Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0
Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface
10.12.23.1
FAULTY
Feb 01 21:40:15 [13253] ga1-ext        cib:     info: crm_cs_flush:
Sent 4 CPG messages  (0 remaining, last=48): OK (1)
Feb 01 21:40:15 [13256] ga1-ext       crmd:     info: crm_cs_flush:
Sent 3 CPG messages  (0 remaining, last=24): OK (1)
Feb 01 21:40:16 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.3 -> 0.299.4 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f

ail-count-drbd0']: No such device or address (rc=-6,
origin=local/attrd/34, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l

ast-failure-mysql']: No such device or address (rc=-6,
origin=local/attrd/35, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l

ast-failure-drbd0']: No such device or address (rc=-6,
origin=local/attrd/36, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.4 -> 0.299.5 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.5 -> 0.299.6 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.6 -> 0.299.7 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.7 -> 0.299.8 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.8 -> 0.299.9 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m

aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/38, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l

ast-failure-ClusterIP']: No such device or address (rc=-6,
origin=local/attrd/39, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p

robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/41, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m

aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/43, version=0.299.11)
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
register_fsa_error_adv:   Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
register_fsa_error_adv:   Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 corosync [CMAN  ] ais: deliver_fn source nodeid = 2,
len=24, endian_conv=0
Feb 01 21:40:17 corosync [CMAN  ] memb: Message on port 0 is 6
Feb 01 21:40:17 corosync [CMAN  ] memb: got KILL for node 1
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
register_fsa_error_adv:   Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
register_fsa_error_adv:   Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
crmd_ha_msg_filter:       Another DC detected: ga2-ext
(op=join_offer)
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
do_state_transition:      State transition S_INTEGRATION ->
S_ELECTION [
input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info: update_dc:
Unset DC. Was ga1-ext
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.9 -> 0.299.10 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.10 -> 0.299.11 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:18 [13247] ga1-ext pacemakerd:    error:
pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13247] ga1-ext pacemakerd:    error:
mcp_cpg_destroy:
Connection destroyed
Feb 01 21:40:18 [13247] ga1-ext pacemakerd:     info:
crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13255] ga1-ext      attrd:    error:
pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13255] ga1-ext      attrd:     crit:
attrd_cs_destroy:         Lost connection to Corosync service!
Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main:
Exiting...
Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main:
Disconnecting client 0x238ff10, pid=13256...
Feb 01 21:40:18 [13255] ga1-ext      attrd:    error:
attrd_cib_connection_destroy:     Connection to the CIB
terminated...
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
stonith_shutdown:         Terminating with  1 clients
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
cib_connection_destroy:   Connection to the CIB closed.
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
crm_client_destroy:       Destroying 0 events
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info: main:     Done
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13256] ga1-ext       crmd:    error:
pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13256] ga1-ext       crmd:    error:
crmd_cs_destroy:
connection terminated
Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext        cib:    error:
pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13253] ga1-ext        cib:    error:
cib_cs_destroy:
Corosync connection lost!  Exiting.
Feb 01 21:40:18 [13253] ga1-ext        cib:     info: terminate_cib:
cib_cs_destroy: Exiting fast...
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
crm_client_destroy:       Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
crm_client_destroy:       Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
crm_client_destroy:       Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
tengine_stonith_connection_destroy:       Fencing daemon
disconnected
Feb 01 21:40:18 [13256] ga1-ext       crmd:   notice: crmd_exit:
Forcing immediate exit: Link has been severed (67)
Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [25258] ga1-ext       lrmd:     info:
cancel_recurring_action:  Cancelling operation
ClusterIP_monitor_30000
Feb 01 21:40:18 [25258] ga1-ext       lrmd:  warning:
qb_ipcs_event_sendv:      new_event_notification (25258-13256-6):
Bad
file descriptor (9)
Feb 01 21:40:18 [25258] ga1-ext       lrmd:  warning:
send_client_notify:       Notification of client
crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed
Feb 01 21:40:18 [25258] ga1-ext       lrmd:     info:
crm_client_destroy:       Destroying 1 events
Feb 01 21:40:18 [25260] ga1-ext    pengine:     info:
crm_client_destroy:       Destroying 0 events

--
Cordiali saluti

Alessandro Bono

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss