Thanks for remind, I will forward it to that list. On Sat, Jun 20, 2015 at 10:58 AM, Digimer <lists@xxxxxxxxxx> wrote: > You might want to post this to the clusterlabs mailing list. This list > is being phased out, so there probably aren't many people left here. > > http://clusterlabs.org/mailman/listinfo > > digimer > > On 19/06/15 10:39 PM, jason wrote: >> Hi All, >> >> I have made two patches (attached) to prove this issue really exist. >> >> In prove_cfg_can_send_messages_through_totem_during_sync.patch, it >> first comments out the functionality of ipc_allow_connections to make >> the first sync procedure the same as its subsequents, thus, make >> ipc_allow_connections always true. Then it lets cpg_sync_process() to >> always returns -1 in order to let sync do not stop. So with this >> patch, during corosync startup in a simple one node configuration, one >> can run corosync-cfgtool -R during sync. The result is that cfg >> command CAN pass through ipc and totem to send reload message out then >> do reload. So this patch can prove that cfg(library) can send messages >> through totem during sync without any restriction. >> >> In simualte_a_cfg_message_during_sync_which_breaks_defrag.patch(do not >> need to base on the first one), it simulates a cfg message(named as >> breaker message in patch), and sends it right after sending every sync >> barrier messages. So the breaker message which is sent after the last >> sync barrier message(which cause sync_synchronization_completed()) >> will be sent in the sync procedure but be received by >> assembly_list_inuse/assembly_list_free, not the expected >> assembly_list_inuse_trans/assembly_list_free_trans, and will finally >> cause a "fragmented continuation %u is not equal to assembly >> last_frag_num..." log. I also place a assert(0) after that log in this >> patch to make it easy to see. This can be reporduced by patching this >> patch and simply run a single node configuration. >> >> I hope these two patches can make it easy to illustrate this issue. >> >> On Fri, Jun 5, 2015 at 12:05 AM, jason <huzhijiang@xxxxxxxxx> wrote: >>> Please consider the possibility of this potential issue. As I >>> understand, things may happen like this: >>> a two-node cluster, with Node A and B. >>> >>> Node A Node B >>> 1) Send the last SYNC message(seq n) >>> 2) Send the token >>> 3) Got token >>> 4) Lost the last SYNC message(seq n) >>> 5) SYNC can not be done, need wait retransmit >>> 6) Originate a cfg message(seq n + 1) based on *trans* assembly buffer >>> 7) Can not immediatly delivery cfg message (seq n + 1) based on trans >>> assembly buffer >>> 8) Send the token >>> 9) Got token >>> 10) retransmit last SYNC message(seq n) >>> 11) Send the token >>> 12) Got token >>> 13) Received the last SYNC message(seq n) >>> 14) SYNC done, switch to the normal assembly buffer >>> 15) Delivery cfg message(seq n + 1) based on *normal* assembly buffer! BAD! >>> >>> In this case, cfg message(seq n + 1) will corrupts our *normal* assembly buffer. >>> >>> Because corosync_sending_allowed() (called by cs_ipcs_msg_process()) >>> returns QB_TRUE for those IPC connections who are >>> CS_LIB_FLOW_CONTROL_NOT_REQUIRED, such as cfg. So cfg message can >>> still pass IPC to totem during SYNC, thus, cause the step 6) to >>> happen. >>> >>> One straight way to solve this issue is changing all >>> CS_LIB_FLOW_CONTROL_NOT_REQUIRED to CS_LIB_FLOW_CONTROL_REQUIRED for >>> cfg service. >>> >>> >>> ---------- Forwarded message ---------- >>> From: jason <huzhijiang@xxxxxxxxx> >>> Date: Fri, Jan 30, 2015 at 7:29 PM >>> Subject: Question about message generation/origination during SYNC >>> To: "discuss@xxxxxxxxxxxx" <discuss@xxxxxxxxxxxx> >>> >>> >>> Dear All, >>> >>> By analyzing current corosync code, I found that if some messages can >>> be generated from library during SYNC processing(such as >>> reload/shutdown over cfgtool, because they are >>> CS_LIB_FLOW_CONTROL_NOT_REQUIRED), or in other words, if they can be >>> generated on new_message_queue_trans queue because >>> instance->waiting_trans_ack was set to 1, then they may have chance to >>> be originated after the last SYNC message. In this situation, they >>> will be delivered after instance->waiting_trans_ack and >>> totempg_waiting_transack set back to 0, then the assembly for the >>> normal messages will be used to defrage those messages, not the >>> expected assembly for the trans messages. This may finally result in >>> lost normal messages due to fragment number is not equal to assembly >>> last_frag_num. >>> >>> Please have a look if this really a problem or I have missed something? >>> >>> Thank you! >>> >>> >>> >>> -- >>> Yours, >>> Jason >>> >>> >>> -- >>> Yours, >>> Jason >> >> >> > > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? -- Yours, Jason _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss