Re: Question about message generation/origination during SYNC

jason <huzhijiang@xxxxxxxxx> · Sat, 20 Jun 2015 11:56:50 +0800

 Thanks for remind, I will forward it to that list.

On Sat, Jun 20, 2015 at 10:58 AM, Digimer <lists@xxxxxxxxxx> wrote:
> You might want to post this to the clusterlabs mailing list. This list
> is being phased out, so there probably aren't many people left here.
>
> http://clusterlabs.org/mailman/listinfo
>
> digimer
>
> On 19/06/15 10:39 PM, jason wrote:
>>  Hi All,
>>
>> I have made two patches (attached) to prove this issue really exist.
>>
>> In prove_cfg_can_send_messages_through_totem_during_sync.patch, it
>> first comments out the functionality of ipc_allow_connections to make
>> the first sync procedure the same as its subsequents, thus, make
>> ipc_allow_connections always true. Then it lets cpg_sync_process() to
>> always returns -1 in order to let sync do not stop. So with this
>> patch, during corosync startup in a simple one node configuration, one
>> can run corosync-cfgtool -R during sync. The result is that cfg
>> command CAN pass through ipc and totem to send reload message out then
>> do reload. So this patch can prove that cfg(library) can send messages
>> through totem during sync without any restriction.
>>
>> In simualte_a_cfg_message_during_sync_which_breaks_defrag.patch(do not
>> need to base on the first one), it simulates a cfg message(named as
>> breaker message in patch), and sends it right after sending every sync
>> barrier messages. So the breaker message which is sent after the last
>> sync barrier message(which cause sync_synchronization_completed())
>> will be sent in the sync procedure but be received by
>> assembly_list_inuse/assembly_list_free, not the expected
>> assembly_list_inuse_trans/assembly_list_free_trans, and will finally
>> cause a  "fragmented continuation %u is not equal to assembly
>> last_frag_num..." log. I also place a assert(0) after that log in this
>> patch to make it easy to see.  This can be reporduced by patching this
>> patch and simply run a single node configuration.
>>
>> I hope these two patches can make it easy to illustrate this issue.
>>
>> On Fri, Jun 5, 2015 at 12:05 AM, jason <huzhijiang@xxxxxxxxx> wrote:
>>> Please consider the possibility of this potential issue. As I
>>> understand, things may happen like this:
>>> a two-node cluster, with Node A and B.
>>>
>>> Node A                    Node B
>>>                                1) Send the last SYNC message(seq n)
>>>                                2) Send the token
>>> 3) Got token
>>> 4) Lost the last SYNC message(seq n)
>>> 5) SYNC can not be done, need wait retransmit
>>> 6) Originate a cfg message(seq n + 1) based on *trans* assembly buffer
>>> 7) Can not immediatly delivery cfg message (seq n + 1) based on trans
>>> assembly buffer
>>> 8) Send the token
>>>                                9) Got token
>>>                                10) retransmit last SYNC message(seq n)
>>>                                11) Send the token
>>> 12) Got token
>>> 13) Received the last SYNC message(seq n)
>>> 14) SYNC done, switch to the normal assembly buffer
>>> 15) Delivery cfg message(seq n + 1) based on *normal* assembly buffer! BAD!
>>>
>>> In this case, cfg message(seq n + 1) will corrupts our *normal* assembly buffer.
>>>
>>> Because corosync_sending_allowed() (called by cs_ipcs_msg_process())
>>> returns QB_TRUE for those IPC connections who are
>>> CS_LIB_FLOW_CONTROL_NOT_REQUIRED, such as cfg. So cfg message can
>>> still pass IPC to totem during SYNC, thus, cause the step 6) to
>>> happen.
>>>
>>> One straight way to solve this issue is changing all
>>> CS_LIB_FLOW_CONTROL_NOT_REQUIRED to CS_LIB_FLOW_CONTROL_REQUIRED for
>>> cfg service.
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: jason <huzhijiang@xxxxxxxxx>
>>> Date: Fri, Jan 30, 2015 at 7:29 PM
>>> Subject: Question about message generation/origination during SYNC
>>> To: "discuss@xxxxxxxxxxxx" <discuss@xxxxxxxxxxxx>
>>>
>>>
>>> Dear All,
>>>
>>> By analyzing current corosync code, I found that if some messages can
>>> be generated from library during SYNC processing(such as
>>> reload/shutdown over cfgtool, because they are
>>> CS_LIB_FLOW_CONTROL_NOT_REQUIRED), or in other words, if they can be
>>> generated on new_message_queue_trans queue because
>>> instance->waiting_trans_ack was set to 1, then they may have chance to
>>> be originated after the last SYNC message. In this situation, they
>>> will be delivered after instance->waiting_trans_ack and
>>> totempg_waiting_transack set back to 0, then the assembly for the
>>> normal messages will be used to defrage  those messages, not the
>>> expected assembly for the trans messages. This may finally result in
>>> lost normal messages due to fragment number is not equal to assembly
>>> last_frag_num.
>>>
>>> Please have a look if this really a problem or I have missed something?
>>>
>>> Thank you!
>>>
>>>
>>>
>>> --
>>> Yours,
>>> Jason
>>>
>>>
>>> --
>>> Yours,
>>> Jason
>>
>>
>>
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?

-- 
Yours,
Jason
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss