Re: pacemaker "CPG API: failed Library error"

Alessandro Bono <alessandro.bono@xxxxxxxxx> · Wed, 12 Feb 2014 15:19:18 +0100

Il 12/02/14 14:54, Jan Friesse ha scritto:
Alessandro Bono napsal(a):
Il 12/02/14 14:15, Jan Friesse ha scritto:
Hi Alessandro,
I was looking to log file and it looks like it is starting right after
token was lost. Do you have log BEFORE that happen.
Hi Honza

I have previous log file but something not worked correctly

[root@ga1-ext ~]# ls -alF  /var/log/cluster/corosync.log-2014021*
-rw-rw---- 1 hacluster haclient  28963 Feb 10 03:32
/var/log/cluster/corosync.log-20140210.gz
-rw-rw---- 1 hacluster haclient 169899 Feb 11 03:43
/var/log/cluster/corosync.log-20140211.gz
-rw-rw---- 1 hacluster haclient  35449 Feb 12 03:24
/var/log/cluster/corosync.log-20140212.gz

[root@ga1-ext ~]# zcat /var/log/cluster/corosync.log-20140211.gz |tail
Feb 10 21:47:02 [2248] ga1-ext        cib:     info: crm_client_new:
Connecting 0x280a140 for uid=0 gid=0 pid=2169
id=308fdcf4-f47b-4afb-96f9-126601ff2573
Feb 10 21:47:02 [2248] ga1-ext        cib:     info:
cib_process_request:     Completed cib_query operation for section
'all': OK (rc=0, origin=local/crm_resource/2, version=0.309.18)
Feb 10 21:47:02 [2248] ga1-ext        cib:     info:
cib_process_request:     Forwarding cib_delete operation for section
constraints to master (origin=local/crm_resource/3)
Feb 10 21:47:02 [2248] ga1-ext        cib:     info:
cib_process_request:     Completed cib_apply_diff operation for section
constraints: OK (rc=0, origin=ga2-ext/crm_resource/3, version=0.310.1)
Feb 10 21:47:02 [2249] ga1-ext stonith-ng:     info:
update_cib_stonith_devices:     Updating device list from the cib: new
location constraint
Feb 10 21:47:02 [2249] ga1-ext stonith-ng:   notice: unpack_config:
On loss of CCM Quorum: Ignore
Feb 10 21:47:02 [2248] ga1-ext        cib:     info: crm_client_destroy:
     Destroying 0 events
Feb 10 21:47:02 [2248] ga1-ext        cib:     info: write_cib_contents:
     Archived previous version as /var/lib/heartbeat/crm/cib-45.raw
Feb 10 21:47:02 [2248] ga1-ext        cib:     info: write_cib_contents:
     Wrote version 0.310.0 of the CIB to disk (digest:
7d257caa36077afded22a7b5b47e27e5)
Feb 10 21:47:02 [2248] ga1-ext        cib:     info: retrieveCib:
Reading cluster configuration from: /var/lib/heartbeat/crm/cib.nAfCBw
(digest: /var/lib/heartbeat/crm/cib.Twax6l)

[root@ga1-ext ~]# zcat /var/log/cluster/corosync.log-20140212.gz |head
Feb 11 23:26:01 corosync [TOTEM ] The token was lost in the OPERATIONAL
state.
Feb 11 23:26:01 corosync [TOTEM ] A processor failed, forming new
configuration.
Feb 11 23:26:01 corosync [TOTEM ] Receive multicast socket recv buffer
size (249856 bytes).
Feb 11 23:26:01 corosync [TOTEM ] Transmit multicast socket send buffer
size (249856 bytes).
Feb 11 23:26:01 corosync [TOTEM ] Local receive multicast loop socket
recv buffer size (249856 bytes).
Feb 11 23:26:01 corosync [TOTEM ] Local transmit multicast loop socket
send buffer size (249856 bytes).
Feb 11 23:26:01 corosync [TOTEM ] entering GATHER state from 2.
Feb 11 23:26:03 corosync [TOTEM ] entering GATHER state from 0.
Feb 11 23:26:03 corosync [TOTEM ] Creating commit token because I am the
rep.
Feb 11 23:26:03 corosync [TOTEM ] Saving state aru 14a high seq received
14a

so strange
tonight I can try another full backup and resend you log files

Anyway, give a try to increase token timeout to value like 10. It looks
like you have 2 nodes and by default token timeout is 1 there. 10 is
used for 3 and more nodes. Also I'm unsure if this was not changed
between 6.3 and 6.4.

Just use

<totem token="X" consensus="X + 2000" />

where X is like 10000.
ok, I'll try this config change
on corosync.conf from centos 6.3 era I have
token: 3000
consensus: 5000

Ok. This explains everything. Cman will OVERWRITE all config options
from corosync.conf (or actually, corosync.conf is not read at all if
cman is used) so use <totem token="X" consensus="X + 2000" /> and set X
as 3000 and you should be fine.

I kown corosync.conf it's not used now
I remember lots of problem first time I converted a cluster to cman 
because corosync.conf was present
I have renamed corosync.conf to corosync.conf.old just for situation 
like this :-)

Regards,
   Honza

Regards,
    Honza

Alessandro Bono napsal(a):
Il 10/02/14 15:55, Jan Friesse ha scritto:
Ok, but I would still really like to see log from 6.5 (there were huge
amount of fixes for 6.5).
Hi Honza

I find time to force a full backup
note that I forced a full backup on another vm with lots of data, this
is enough to stress host server and cause error on cluster
attached zipped log file with centos 6.5

thank you

--
Cordiali saluti

Alessandro Bono

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss