It is sometimes judged to be node trouble.

renayama19661014@xxxxxxxxx · Mon, 9 Mar 2015 10:09:36 +0900 (JST)

Hi All,

We constitute a cluster in corosync.
We shutdown one node afterwards.

Then the node that we shutdown is sometimes judged with fail by a cluster.

---------------------------------------
Oct 21 11:03:30 XXX corosync[21677]: [TOTEM ] A processor failed, forming new configuration.
---------------------------------------

This phenomenon seems to occur with very low probability.

We think that it is a problem that there is the log that the node that we shutdown is taken as trouble.

The problem is because the leave message(memb_leave_message_send) which is sent when a user stops corosync may be thrown away.

static int net_deliver_fn (
int fd,
int revents,
void *data)
{
struct totemudp_instance *instance = (struct totemudp_instance *)data;
struct msghdr msg_recv;
struct iovec *iovec;
(snip)
/*
 * Drop all non-mcast messages (more specifically join
 * messages should be dropped)
 */
message_type = (char *)iovec->iov_base;
if (instance->flushing == 1 && *message_type == MESSAGE_TYPE_MEMB_JOIN) {
iovec->iov_len = FRAME_SIZE_MAX;
return (0);
}
(snip)

A secession leave is handled definitely and wishes a node stops.
Is the correction of the handling of problem of corosync possible?
 * We think that it is a problem that there is the log that the node that we shutdown is taken as trouble.

We hope that this problem is revised in the next version if possible.

Best Regards,
Hideo Yamauchi.

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss