Re: Question: Duration of DC election

Andrew Beekhof <andrew@xxxxxxxxxxx> · Mon, 31 Aug 2015 11:52:37 +1000

> On 28 Aug 2015, at 8:00 pm, Stefan Wenk <Stefan.wenk@xxxxxxx> wrote:
> 
> Hi Andrew,
> 
> thank you for the answers!
> 
>> There _should_ be an entry for “time to fence the peer”, but based on your reported times I’m assuming you’ve turned that off.
> There is no such entry in the log, but you are right, I turned fencing off. Would active fencing decrease the overall downtime?

No, it would arguably increase it.
It would however prevent corruption and ensure the service is only being made available on one node at a time.

> 
>> I’d be interested to know what log messages you’re basing your timing numbers on.
> Until yesterday, the timings numbers were only based on measuring on application level. I'm using sipp client scripts and capture the SIP traffic with tcpdump into a file and perform analysis with wireshark. As this is not so easy to explain and requires at least basic SIP application level understanding, I tried to improve the corosync/pacemaker logging and succeeded by using rsyslog with high precision timestamps enabled and tuning of  PCMK_* variables in /etc/sysconfig/pacemaker. Therefore I can now provide logging entries. 
> 
> The logging below is taken on the node, which takes over when the power-loss scenario on the other active node is happening.
> 
> This line seems to be the first log entry, when the detection of the lost node is noticed:
> (1) 2015-08-27T09:22:33.558245+02:00 pc-q2 crmd[18187]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node pc-q1[16843022] - state is now lost (was member)
> 
> After this log entry there are couple of entries for the processes you explained below. I don't think that they are important. It gets interesting 107ms later here:
> 
> (2) 2015-08-27T09:22:33.665932+02:00 pc-q2 crmd[18187]:  warning: do_log: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION

So thats the election completed

> (3) 2015-08-27T09:22:34.804145+02:00 pc-q2 pengine[18186]:   notice: update_validation: pacemaker-1.2-style configuration is also valid for pacemaker-1.3
> ..
> (4) 2015-08-27T09:22:34.831664+02:00 pc-q2 crmd[18187]:   notice: te_rsc_command: Initiating action 5: start resSIPIP_start_0 on pc-q2 (local)
> 
> It looks as the policy engine  needs 1139ms to make its decision. This is the time difference between the log of line (2) and line (3).

No, the PE doesn’t get started until step 3.
The 2-3 time delta involves making sure the cib is accurate and sending it to the PE.

3-4 is the better approximation and you should also look for when the start completes.

> Is such a quite long duration expected here? Are there any tuning possibilities or do you need more detailed logging information. It looks as it is possible to accomplish more detailed loggings by setting other PCMK variables, e.g. PCMK_trace_tags, but I was not able to figure out useful settings for my case.

PCMK_trace_functions is a better bet.

> 
> Thank you,
> 
> Stefan
> 
> 
> 
>> -----Original Message-----
>> From: Andrew Beekhof [mailto:andrew@xxxxxxxxxxx] 
>> Sent: Donnerstag, 27. August 2015 05:06
>> To: Stefan Wenk <Stefan.wenk@xxxxxxx>
>> Cc: discuss@xxxxxxxxxxxx; Cluster Labs - All topics related to open-source clustering welcomed <users@xxxxxxxxxxxxxxx>
>> Subject: Re:  Question: Duration of DC election
>> 
>> 
>>> On 25 Aug 2015, at 7:46 pm, Stefan Wenk <Stefan.wenk@xxxxxxx> wrote:
>>> 
>>> Hi,
>>> 
>>> I'm performing downtime measurement tests using corosync version 2.3.0 and pacemaker version 1.1.12  under RHEL 6.5 MRG and although not recommended, I tuned the corosync configuration settings to following insane values:
>>> 
>>>       # Timeout for token
>>>       token: 60
>>>       token_retransmits_before_loss_const: 1
>>> 
>>>       # How long to wait for join messages in the membership protocol (ms)
>>>       join: 35
>>>       consensus: 70
>>> 
>>> My two node cluster consists of a kamailio clone resource, which replicates the so called userlocation state using DMQ on application level (see [1]). The switchover performs the migration of a ocf:heartbeat:IPaddr2 resource. With these settings, the service downtime is lower 100ms in case of a controlled cluster switchover, when "/etc/init.d/pacemaker stop" and "/etc/init.d/corosync stop" get executed. 
>>> 
>>> The service downtime is about 400ms when the power loss is simulated on the active node, which does not execute the DC task. When I simulate power loss on the active node, which is active and executes the DC task, the service downtime increases to about 1500ms. As the timestamps in the logs are on second resolution only, it is hard to provide more detailed numbers, but apparently the DC election procedure takes more than 1000ms.
>>> 
>>> Are there any possibilities to tune the DC election process? Is there documentation available what is happening in this situation?
>>> 
>>> Tests with more nodes in the cluster showed that the service downtime increases with the number of online cluster nodes, even if the DC is executed on one of the nodes, which remain active. 
>> 
>> When there is only 2 nodes, then there is effectively no election happening and the delay is made up of:
>> - corosync detection time
>> - time for the crmd to send a message to itself via corosync
>> - time for the policy engine to figure out where to put the service
>> - time for the start action of your service(s) to execute
>> 
>> There _should_ be an entry for “time to fence the peer”, but based on your reported times I’m assuming you’ve turned that off.
>> 
>> As the node count goes up, elections need to start happening for real (so you need to hear from everyone and have them all agree on a winner) but still it should be pretty quick.
>> The policy engine will take incrementally longer because it has more nodes to loop through, but that should be negligible on the scale that corosync can operate at.
>> 
>> I’d be interested to know what log messages you’re basing your timing numbers on.
>> 
>>> 
>>> I'm using one ring only. It looks as the usage of two rings do not change the test results a lot.
>>> 
>>> Thank you,
>>> 
>>> Stefan
>>> 
>>> [1] http://kamailio.org/docs/modules/devel/modules/dmq.html
>>> 
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss