Please consider the possibility of this potential issue. As I understand, things may happen like this: a two-node cluster, with Node A and B. Node A Node B 1) Send the last SYNC message(seq n) 2) Send the token 3) Got token 4) Lost the last SYNC message(seq n) 5) SYNC can not be done, need wait retransmit 6) Originate a cfg message(seq n + 1) based on *trans* assembly buffer 7) Can not immediatly delivery cfg message (seq n + 1) based on trans assembly buffer 8) Send the token 9) Got token 10) retransmit last SYNC message(seq n) 11) Send the token 12) Got token 13) Received the last SYNC message(seq n) 14) SYNC done, switch to the normal assembly buffer 15) Delivery cfg message(seq n + 1) based on *normal* assembly buffer! BAD! In this case, cfg message(seq n + 1) will corrupts our *normal* assembly buffer. Because corosync_sending_allowed() (called by cs_ipcs_msg_process()) returns QB_TRUE for those IPC connections who are CS_LIB_FLOW_CONTROL_NOT_REQUIRED, such as cfg. So cfg message can still pass IPC to totem during SYNC, thus, cause the step 6) to happen. One straight way to solve this issue is changing all CS_LIB_FLOW_CONTROL_NOT_REQUIRED to CS_LIB_FLOW_CONTROL_REQUIRED for cfg service. ---------- Forwarded message ---------- From: jason <huzhijiang@xxxxxxxxx> Date: Fri, Jan 30, 2015 at 7:29 PM Subject: Question about message generation/origination during SYNC To: "discuss@xxxxxxxxxxxx" <discuss@xxxxxxxxxxxx> Dear All, By analyzing current corosync code, I found that if some messages can be generated from library during SYNC processing(such as reload/shutdown over cfgtool, because they are CS_LIB_FLOW_CONTROL_NOT_REQUIRED), or in other words, if they can be generated on new_message_queue_trans queue because instance->waiting_trans_ack was set to 1, then they may have chance to be originated after the last SYNC message. In this situation, they will be delivered after instance->waiting_trans_ack and totempg_waiting_transack set back to 0, then the assembly for the normal messages will be used to defrage those messages, not the expected assembly for the trans messages. This may finally result in lost normal messages due to fragment number is not equal to assembly last_frag_num. Please have a look if this really a problem or I have missed something? Thank you! -- Yours, Jason -- Yours, Jason _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss