Re: how is failure detection achieved in Corosync?

Jan Friesse <jfriesse@xxxxxxxxxx> · Mon, 15 Apr 2013 09:12:27 +0200

Alejandro Z. Tomsic napsal(a):
> Honza,
> 
> thank you for your help.
> further questions:
> 
> On 10/04/2013, at 15:10, Jan Friesse <jfriesse@xxxxxxxxxx> wrote:
> 
>> Alejandro Z. Tomsic napsal(a):
>>> I would like to know how the process of failure detection is achieved in Corosync (if any). I would like to know about the implementation details, i.e. if its done at physical, virtual machine or at application level. Does Corosync use any known failure detection mechanisms? e.g. [1][2][3][4] or any other. Where can I find this information?
>>
>> Totem is based on circulating token and lost of token (so token was not
>> delivered for given time) is used as failure detector (so weak
>> detector). Corosync also implements (optionally) hearth beating.
>>
>> For more informations take a look to:
>> https://github.com/corosync/corosync/wiki/Developers#reference-documentation
> 
> do you think that there is a modular way to replace this for different failure detectors? I am interested in making a comparison and evaluation of different mechanisms.

That's hard question and really depends on what you want to achieve. I mean:

- It should be possible to use totemsrp as membership service (set token
loss timeout to infinite) and test various failure detectors
- Implement failure detector and build membership service on top of that
(because upper layers depends on membership information), so basically
rewrite totemsrp

In both ways, totemsrp.c is file you are interested in (and probably
only one where radical changes are needed). Also totemsrp.c is NOT well
prepared for such change so it will mean a lot of hacking.

Regards,
  Honza

btw. If you will decide to make changes, can you please public results
to corosync ML together with diff of source code? It may be interesting
and maybe useful.

> 
> 
>>
>> Especially
>> http://corosync.github.com/corosync/doc/DAAgarwal.thesis.ps.gz should
>> give you informations you need.
>>
>> Regards,
>>  Honza
>>
> 
> Best,
> 
> Alejandro
> 
> 
>>>
>>> Thank you in advance.
>>>
>>> Alejandro
>>>
>>>
>>>
>>>
>>> [1] M.Bertier,O.Marin,andP.Sens.Implementation and performance evaluation of an adaptable failure detector. In International Conference on Dependable Systems and Networks (DSN), pages 354–363, June 2002.
>>>
>>> [2] W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Transactions on Computers, 51(5):561–580, May 2002.
>>>
>>> [3] N. Hayashibara, X. De ́fago, R. Yared, and T. Katayama. The φ accrual failure detector. In IEEE Symposium on Reliable Distributed Systems (SRDS), pages 66–78, Oct. 2004. 
>>>
>>> [4] Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K. Aguilera, and Michael Walfish. 2011. Detecting failures in distributed systems with the Falcon spy network. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP '11). ACM, New York, NY, USA, 279-294. 
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss