You might want to post this to the clusterlabs mailing list. This list is being phased out, so there probably aren't many people left here. http://clusterlabs.org/mailman/listinfo digimer On 19/06/15 10:39 PM, jason wrote: > Hi All, > > I have made two patches (attached) to prove this issue really exist. > > In prove_cfg_can_send_messages_through_totem_during_sync.patch, it > first comments out the functionality of ipc_allow_connections to make > the first sync procedure the same as its subsequents, thus, make > ipc_allow_connections always true. Then it lets cpg_sync_process() to > always returns -1 in order to let sync do not stop. So with this > patch, during corosync startup in a simple one node configuration, one > can run corosync-cfgtool -R during sync. The result is that cfg > command CAN pass through ipc and totem to send reload message out then > do reload. So this patch can prove that cfg(library) can send messages > through totem during sync without any restriction. > > In simualte_a_cfg_message_during_sync_which_breaks_defrag.patch(do not > need to base on the first one), it simulates a cfg message(named as > breaker message in patch), and sends it right after sending every sync > barrier messages. So the breaker message which is sent after the last > sync barrier message(which cause sync_synchronization_completed()) > will be sent in the sync procedure but be received by > assembly_list_inuse/assembly_list_free, not the expected > assembly_list_inuse_trans/assembly_list_free_trans, and will finally > cause a "fragmented continuation %u is not equal to assembly > last_frag_num..." log. I also place a assert(0) after that log in this > patch to make it easy to see. This can be reporduced by patching this > patch and simply run a single node configuration. > > I hope these two patches can make it easy to illustrate this issue. > > On Fri, Jun 5, 2015 at 12:05 AM, jason <huzhijiang@xxxxxxxxx> wrote: >> Please consider the possibility of this potential issue. As I >> understand, things may happen like this: >> a two-node cluster, with Node A and B. >> >> Node A Node B >> 1) Send the last SYNC message(seq n) >> 2) Send the token >> 3) Got token >> 4) Lost the last SYNC message(seq n) >> 5) SYNC can not be done, need wait retransmit >> 6) Originate a cfg message(seq n + 1) based on *trans* assembly buffer >> 7) Can not immediatly delivery cfg message (seq n + 1) based on trans >> assembly buffer >> 8) Send the token >> 9) Got token >> 10) retransmit last SYNC message(seq n) >> 11) Send the token >> 12) Got token >> 13) Received the last SYNC message(seq n) >> 14) SYNC done, switch to the normal assembly buffer >> 15) Delivery cfg message(seq n + 1) based on *normal* assembly buffer! BAD! >> >> In this case, cfg message(seq n + 1) will corrupts our *normal* assembly buffer. >> >> Because corosync_sending_allowed() (called by cs_ipcs_msg_process()) >> returns QB_TRUE for those IPC connections who are >> CS_LIB_FLOW_CONTROL_NOT_REQUIRED, such as cfg. So cfg message can >> still pass IPC to totem during SYNC, thus, cause the step 6) to >> happen. >> >> One straight way to solve this issue is changing all >> CS_LIB_FLOW_CONTROL_NOT_REQUIRED to CS_LIB_FLOW_CONTROL_REQUIRED for >> cfg service. >> >> >> ---------- Forwarded message ---------- >> From: jason <huzhijiang@xxxxxxxxx> >> Date: Fri, Jan 30, 2015 at 7:29 PM >> Subject: Question about message generation/origination during SYNC >> To: "discuss@xxxxxxxxxxxx" <discuss@xxxxxxxxxxxx> >> >> >> Dear All, >> >> By analyzing current corosync code, I found that if some messages can >> be generated from library during SYNC processing(such as >> reload/shutdown over cfgtool, because they are >> CS_LIB_FLOW_CONTROL_NOT_REQUIRED), or in other words, if they can be >> generated on new_message_queue_trans queue because >> instance->waiting_trans_ack was set to 1, then they may have chance to >> be originated after the last SYNC message. In this situation, they >> will be delivered after instance->waiting_trans_ack and >> totempg_waiting_transack set back to 0, then the assembly for the >> normal messages will be used to defrage those messages, not the >> expected assembly for the trans messages. This may finally result in >> lost normal messages due to fragment number is not equal to assembly >> last_frag_num. >> >> Please have a look if this really a problem or I have missed something? >> >> Thank you! >> >> >> >> -- >> Yours, >> Jason >> >> >> -- >> Yours, >> Jason > > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss