Re: can the changing node identity returned by local_get be handled reliably?

Jan Friesse <jfriesse@xxxxxxxxxx> · Tue, 24 Apr 2012 09:50:17 +0200

Dan,

dan clark napsal(a):
Hi Jan!

I agree that it is difficult for applications to handle the corosync
daemons changing identity from a bound interface to loopback and then back
to the same interface.

I wonder what the consequence of not using the multicast loop for echoing
back the local message might be for both message delivery guarantees and
timing of message delivery.

No one. Basically how it works today is:
- send mcast message
- kernel process message thru iptables, sends back and send to mcast group
- corosync receives message

My idea is to let it work in following fashion:
- send mcast message
- kernel process message thru iptablesk and send to mcast group
- corosync routine for message process is called directly, instead of 
depend on kernel

First message delivery considerations.
In the current design is it the case that the message is delivered to all
receivers in the group if an only if the message is reflected back to the
sender?  

I'm talking about multicast deliver. Multicast message deliver itself is 
not important for ordering. Token is, and token must still be sent thru 
wire.

For example, given a pre-condition of a group size of three and a
sender on node 1 of nodes 1,2,3.  If a message is sent from node 1 right
when a remote process goes down on node 3, will message arrival order
guarantee how the message was delivered:
scenario 1: assume 3 got the message
message reflected,
group membership changes
scenario 2: assume 3 did not get message but 1 & 2 did.
group membership change received
message reflected
scenario 3: message send fails (doesn't happen?)

Shortly, corosync implements extended virtual synchrony. In other words, 
messages are always delivered to all members OR rest of members in 
previous membership if membership is changing.

Second timing considerations.
Say it would be interesting to avoid the flood of messages which occurs
during high level debugging of message delivery in corosync.  Instead it
might be replaced with a message delivery historgram which counts which
bucket a message gets delivered into (0-64ms, 64-128ms, 128-256ms,
256-1024ms, 1024-4096ms, 4096-16sec, > 16sec, not delivered)  Could the
sender get a bound on the time for all messages delivered by only timing
the reflected message?
Clearly, simply calling back with the sent message might reduce latency on
the reflection of the message, but would it accurately reflect the
guarantees and message ordering of the current system, or is that already
not a safe assumption?

I believe you've already sent that idea, and it's definitively something 
to thing about.

I understand the idea of not rebinding.  Does not rebinding allow the local
applications to continue to use their groups internally?  How would not

Ya

rebinding impact nodes coming back together?    If an interface changes to

No impact.

a new IP address on a different subnet how would this be handled?

This case is not handled at all now.

(recommend a full restart?)  How is it detected and reported by corosync?

probably recommend full restart in 1.X and 2.X, and put in roadmap to 
find out better solution (properly documented, with proper examples how 
to handle such situations) for 3.X.

Actually, currently change of subnet is detected and reported explicitly 
at all, but you can see message like "interface is now down" and then 
"interface 127.0.0.1 is now up". I would like to see better message.

Thank you for discussing the issues.

I realize now that the modified testcpg.c attachement was scrubbed.  Is
there a desired location for the upload?

It was not scrubbed, but I was enjoying PTO ;) I will ACK/NACK/Push soon.

Honza

dan

On Fri, Apr 13, 2012 at 1:51 AM, Jan Friesse <jfriesse@xxxxxxxxxx> wrote:

Dan,
there are two problems I see with current corosync.
1.) Localhost rebinding
2.) Rely on kernel multicast loop facility

My opinion on them is simple. Both must go and must be replaced by:
1.) Don't use multicast loop. Move message directly from send function to
receive function for local node
2.) Never rebind

It's really impossible for application authors to handle this "change of
identity" behavior.

And solution for both problems are on top place of my TODO list (so expect
them in 2.1 and backported to flatiron).

Regards,
 Honza

dan clark napsal(a):

Hi Folks!

Thank you Christine for a well written test application and leading the
way
with the apropos comment "NOTE: in reality we should also check the
nodeid". Some comments are more easily addressed then others!

During various failure tests a variety of conditions seem to trigger a
client application of the cpg library to change it's node identity.  I
know
this has been discussed under various guises with respect to the proper
way
to fail a network (don't ifdown/ifup and interface.  Oddly enough,
however,
a common dynamic reconfiguration step on a node is to do a 'service
network
restart' which tends to do ifdown/ifup on interfaces.  Designing
applications to be resilient to common failures is often desirable,
including the restart of a service (such as corosync) so I have included a
slightly modified version of testcpg.c that provides such resiliency.  I
wonder, however, if the nature of the changing identity of the node
information returned from cpg_local_get can be relied on across versions
or
if this is abhorrent or transient behaviour that might change?   Note,
that
once the node identity has changed if an application continues to maintain
use of a group then once the cluster is reformed that group is isolated
from other groups, despite sharing a common name.  Furthermore there are
impacts on other applications on the isolated node that might share the
use
of that group.

On a separate note, is there a way to change the seemingly fixed 20-30
second delay upon the daemons re-joining a cluster separated due to
isolated network conditions (power cycling a switch for example)?

Note the following output indicating the first realization of a node that
the configuration has changed and how the local_get indicates that the
node
identity is now (127.0.0.1) as opposed to the original value which was
(4.0.0.8).  Tests performed on version 1.4.2.

 dcube-8:28506 2012-04-09 14:02:06.803
ConfchgCallback: group 'GROUP'
Local node id is 127.0.0.1/100007f result 1
left node/pid 4.0.0.2/15796 reason: 3
nodes in group now 3
node/pid 4.0.0.4/2655
node/pid 4.0.0.6/4238
node/pid 4.0.0.8/28506
....

Finally even though the reported identity is loopback, the original id is
matched due to the static cache from the time of the join.  Is there a
race
condition, however, that just after the join if there is a network failure
that the identity might change before the initialization logic is complete
and thus even the modified sample program is open to a failure?

dcube-8:28506 2012-04-09 14:02:06.803
ConfchgCallback: group 'GROUP'
Local node id is 127.0.0.1/100007f result 1
left node/pid 4.0.0.8/28506 reason: 3
nodes in group now 0
We might have left the building pid 28506
We probably left the building switched identity? start nodeid 134217732
nodeid 134217732 current nodeid 16777343 pid 28506
We have left the building direct match start nodeid 134217732 nodeid
134217732 local get current nodeid 16777343 pid 28506

Perhaps the test application for the release could be updated to include
appropriate testing for the nodeid?

Dan

------------------------------**------------------------------**
------------

______________________________**_________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/**mailman/listinfo/discuss<http://lists.corosync.org/mailman/listinfo/discuss>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss