Re: SAFE delivery feature request

Steven Dake <sdake@xxxxxxxxxx> · Mon, 20 Jan 2014 08:33:13 -0700

On 01/18/2014 12:16 PM, andrei.elkin@xxxxxxxxxx wrote:
Hello.

As a new member of the mail-list let me start off with
thanking a lot for this great piece of software!

Unfortunately unimplemented CPG_TYPE_SAFE seriously deter
State Machine Replication projects, database replication,
from utilizing this type of communication mechanism.
A use case that shows the danger must be well known. Yet it would be
good to describe it here, maybe I will learn what workarounds
people found.
Implementing SAFE in totemsrp is dead simple.  Implementing SAFE in 
totempg (the fragmentation and assembly layer) + cpg is much more 
difficult.  Another problem is there is no way to verify the IPC 
delivery queue has actually delivered a message that is tied into the 
implementation of Totem.  In the past where I have said implementing 
SAFE is easy, I have meant the totemsrp.c codebase.  It is probably less 
then 10 lines of code change.  The hardest part is dealing with 
configuration changes.

I'm not sure that implementing safe at that level would actually give 
you what you want with cpg.

What would be handy for totem to have is a CPG that avoids the totempg 
layer entirely (and limits message sizes to MTU) so that applications 
could indeed utilize SAFE guarantees correctly.  The apps themselves 
would have to be responsible for handling fragmentation and assembly 
though, which is how most modern applications of Totem work outside of 
the Corosync universe.

Regards
-steve

   Suppose the cluster consists of three nodes N1, N2 and N3.
   By some time they all delivered (totally ordered) k-1 messages.
   Suppose at that point N1 sends out its message and at once the ring splits
   into N1 and N2+N3 subrings so that the N1's message gets lost for N2+N3.
   Thanks to only available CPG_TYPE_AGREED delivery semantics N1 may
   deliver (order) the message as m_k so the application instance on N1 will process it
   to change its state, let's denote that formally as

       N1.state = apply(m_k).

   N2 + N3 application state would remain corresponding to m_k-1 message.
   But if they took *at once* on the cluster role, which they could
   'cos of being a majority of the former membership, the first message
   they might deliver would make their states inconsistent with that of N1, 'cos

      N2.state = apply(m_k'), m_k' != m_k.

   Notice that inconsistency can't generally be mended by exchanging m_k'
   and m_k if N1 will meet N2+N3 again in a common configuration.

So to summarize the description, any quorate solution for the cluster role takeover
generally can't work. For instance the database replication deems to be
unfeasible.

As to workarounds there is just one that I see:
when the totem ring configuration changes like above the cluster service
should be deferred until N1 is back.
It can't be counted as universal I think. At the same time SAFE delivery
should not really a challenging task, according to my reading of the
Totem protocol, as well as to a mail found

   From sdake at redhat.com  Sun Mar 11 20:18:51 2012
   From: sdake at redhat.com (Steven Dake)
   Date: Sun, 11 Mar 2012 13:18:51 -0700
   Subject:  [PATCH] drop evs service
   In-Reply-To: <1331449088-28169-1-git-send-email-fdinitto@xxxxxxxxxx>
   References: <1331449088-28169-1-git-send-email-fdinitto@xxxxxxxxxx>
   Message-ID: <4F5D08AB.1010202@xxxxxxxxxx>

   Ugh
   On 03/10/2012 11:58 PM, Fabio M. Di Nitto wrote:
   > From: "Fabio M. Di Nitto" <fdinitto at redhat.com>
   >
   > there are several reasons for this:
   >
   > 1) evs is only partially implemented with no plans to complete it
   >
   > typedef enum {
   >        EVS_TYPE_UNORDERED, /* not implemented */
   >        EVS_TYPE_FIFO,          /* same as agreed */
   >        EVS_TYPE_AGREED,
   >        EVS_TYPE_SAFE           /* not implemented */
   > } evs_guarantee_t;
   >

   We should implement safe at some point - its pretty easy to do.

With best wishes,

Andrei
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss