Re: Question about message generation/origination during SYNC

Digimer <lists@xxxxxxxxxx> · Fri, 19 Jun 2015 22:58:13 -0400

You might want to post this to the clusterlabs mailing list. This list
is being phased out, so there probably aren't many people left here.

http://clusterlabs.org/mailman/listinfo

digimer

On 19/06/15 10:39 PM, jason wrote:
>  Hi All,
> 
> I have made two patches (attached) to prove this issue really exist.
> 
> In prove_cfg_can_send_messages_through_totem_during_sync.patch, it
> first comments out the functionality of ipc_allow_connections to make
> the first sync procedure the same as its subsequents, thus, make
> ipc_allow_connections always true. Then it lets cpg_sync_process() to
> always returns -1 in order to let sync do not stop. So with this
> patch, during corosync startup in a simple one node configuration, one
> can run corosync-cfgtool -R during sync. The result is that cfg
> command CAN pass through ipc and totem to send reload message out then
> do reload. So this patch can prove that cfg(library) can send messages
> through totem during sync without any restriction.
> 
> In simualte_a_cfg_message_during_sync_which_breaks_defrag.patch(do not
> need to base on the first one), it simulates a cfg message(named as
> breaker message in patch), and sends it right after sending every sync
> barrier messages. So the breaker message which is sent after the last
> sync barrier message(which cause sync_synchronization_completed())
> will be sent in the sync procedure but be received by
> assembly_list_inuse/assembly_list_free, not the expected
> assembly_list_inuse_trans/assembly_list_free_trans, and will finally
> cause a  "fragmented continuation %u is not equal to assembly
> last_frag_num..." log. I also place a assert(0) after that log in this
> patch to make it easy to see.  This can be reporduced by patching this
> patch and simply run a single node configuration.
> 
> I hope these two patches can make it easy to illustrate this issue.
> 
> On Fri, Jun 5, 2015 at 12:05 AM, jason <huzhijiang@xxxxxxxxx> wrote:
>> Please consider the possibility of this potential issue. As I
>> understand, things may happen like this:
>> a two-node cluster, with Node A and B.
>>
>> Node A                    Node B
>>                                1) Send the last SYNC message(seq n)
>>                                2) Send the token
>> 3) Got token
>> 4) Lost the last SYNC message(seq n)
>> 5) SYNC can not be done, need wait retransmit
>> 6) Originate a cfg message(seq n + 1) based on *trans* assembly buffer
>> 7) Can not immediatly delivery cfg message (seq n + 1) based on trans
>> assembly buffer
>> 8) Send the token
>>                                9) Got token
>>                                10) retransmit last SYNC message(seq n)
>>                                11) Send the token
>> 12) Got token
>> 13) Received the last SYNC message(seq n)
>> 14) SYNC done, switch to the normal assembly buffer
>> 15) Delivery cfg message(seq n + 1) based on *normal* assembly buffer! BAD!
>>
>> In this case, cfg message(seq n + 1) will corrupts our *normal* assembly buffer.
>>
>> Because corosync_sending_allowed() (called by cs_ipcs_msg_process())
>> returns QB_TRUE for those IPC connections who are
>> CS_LIB_FLOW_CONTROL_NOT_REQUIRED, such as cfg. So cfg message can
>> still pass IPC to totem during SYNC, thus, cause the step 6) to
>> happen.
>>
>> One straight way to solve this issue is changing all
>> CS_LIB_FLOW_CONTROL_NOT_REQUIRED to CS_LIB_FLOW_CONTROL_REQUIRED for
>> cfg service.
>>
>>
>> ---------- Forwarded message ----------
>> From: jason <huzhijiang@xxxxxxxxx>
>> Date: Fri, Jan 30, 2015 at 7:29 PM
>> Subject: Question about message generation/origination during SYNC
>> To: "discuss@xxxxxxxxxxxx" <discuss@xxxxxxxxxxxx>
>>
>>
>> Dear All,
>>
>> By analyzing current corosync code, I found that if some messages can
>> be generated from library during SYNC processing(such as
>> reload/shutdown over cfgtool, because they are
>> CS_LIB_FLOW_CONTROL_NOT_REQUIRED), or in other words, if they can be
>> generated on new_message_queue_trans queue because
>> instance->waiting_trans_ack was set to 1, then they may have chance to
>> be originated after the last SYNC message. In this situation, they
>> will be delivered after instance->waiting_trans_ack and
>> totempg_waiting_transack set back to 0, then the assembly for the
>> normal messages will be used to defrage  those messages, not the
>> expected assembly for the trans messages. This may finally result in
>> lost normal messages due to fragment number is not equal to assembly
>> last_frag_num.
>>
>> Please have a look if this really a problem or I have missed something?
>>
>> Thank you!
>>
>>
>>
>> --
>> Yours,
>> Jason
>>
>>
>> --
>> Yours,
>> Jason
> 
> 
> 

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss