Fwd: Question about message generation/origination during SYNC

jason <huzhijiang@xxxxxxxxx> · Fri, 5 Jun 2015 00:05:15 +0800

Please consider the possibility of this potential issue. As I
understand, things may happen like this:
a two-node cluster, with Node A and B.

Node A                    Node B
                               1) Send the last SYNC message(seq n)
                               2) Send the token
3) Got token
4) Lost the last SYNC message(seq n)
5) SYNC can not be done, need wait retransmit
6) Originate a cfg message(seq n + 1) based on *trans* assembly buffer
7) Can not immediatly delivery cfg message (seq n + 1) based on trans
assembly buffer
8) Send the token
                               9) Got token
                               10) retransmit last SYNC message(seq n)
                               11) Send the token
12) Got token
13) Received the last SYNC message(seq n)
14) SYNC done, switch to the normal assembly buffer
15) Delivery cfg message(seq n + 1) based on *normal* assembly buffer! BAD!

In this case, cfg message(seq n + 1) will corrupts our *normal* assembly buffer.

Because corosync_sending_allowed() (called by cs_ipcs_msg_process())
returns QB_TRUE for those IPC connections who are
CS_LIB_FLOW_CONTROL_NOT_REQUIRED, such as cfg. So cfg message can
still pass IPC to totem during SYNC, thus, cause the step 6) to
happen.

One straight way to solve this issue is changing all
CS_LIB_FLOW_CONTROL_NOT_REQUIRED to CS_LIB_FLOW_CONTROL_REQUIRED for
cfg service.

---------- Forwarded message ----------
From: jason <huzhijiang@xxxxxxxxx>
Date: Fri, Jan 30, 2015 at 7:29 PM
Subject: Question about message generation/origination during SYNC
To: "discuss@xxxxxxxxxxxx" <discuss@xxxxxxxxxxxx>

Dear All,

By analyzing current corosync code, I found that if some messages can
be generated from library during SYNC processing(such as
reload/shutdown over cfgtool, because they are
CS_LIB_FLOW_CONTROL_NOT_REQUIRED), or in other words, if they can be
generated on new_message_queue_trans queue because
instance->waiting_trans_ack was set to 1, then they may have chance to
be originated after the last SYNC message. In this situation, they
will be delivered after instance->waiting_trans_ack and
totempg_waiting_transack set back to 0, then the assembly for the
normal messages will be used to defrage  those messages, not the
expected assembly for the trans messages. This may finally result in
lost normal messages due to fragment number is not equal to assembly
last_frag_num.

Please have a look if this really a problem or I have missed something?

Thank you!

--
Yours,
Jason

-- 
Yours,
Jason
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss