Re: Question about message generation/origination during SYNC

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



 Hi All,

I have made two patches (attached) to prove this issue really exist.

In prove_cfg_can_send_messages_through_totem_during_sync.patch, it
first comments out the functionality of ipc_allow_connections to make
the first sync procedure the same as its subsequents, thus, make
ipc_allow_connections always true. Then it lets cpg_sync_process() to
always returns -1 in order to let sync do not stop. So with this
patch, during corosync startup in a simple one node configuration, one
can run corosync-cfgtool -R during sync. The result is that cfg
command CAN pass through ipc and totem to send reload message out then
do reload. So this patch can prove that cfg(library) can send messages
through totem during sync without any restriction.

In simualte_a_cfg_message_during_sync_which_breaks_defrag.patch(do not
need to base on the first one), it simulates a cfg message(named as
breaker message in patch), and sends it right after sending every sync
barrier messages. So the breaker message which is sent after the last
sync barrier message(which cause sync_synchronization_completed())
will be sent in the sync procedure but be received by
assembly_list_inuse/assembly_list_free, not the expected
assembly_list_inuse_trans/assembly_list_free_trans, and will finally
cause a  "fragmented continuation %u is not equal to assembly
last_frag_num..." log. I also place a assert(0) after that log in this
patch to make it easy to see.  This can be reporduced by patching this
patch and simply run a single node configuration.

I hope these two patches can make it easy to illustrate this issue.

On Fri, Jun 5, 2015 at 12:05 AM, jason <huzhijiang@xxxxxxxxx> wrote:
> Please consider the possibility of this potential issue. As I
> understand, things may happen like this:
> a two-node cluster, with Node A and B.
>
> Node A                    Node B
>                                1) Send the last SYNC message(seq n)
>                                2) Send the token
> 3) Got token
> 4) Lost the last SYNC message(seq n)
> 5) SYNC can not be done, need wait retransmit
> 6) Originate a cfg message(seq n + 1) based on *trans* assembly buffer
> 7) Can not immediatly delivery cfg message (seq n + 1) based on trans
> assembly buffer
> 8) Send the token
>                                9) Got token
>                                10) retransmit last SYNC message(seq n)
>                                11) Send the token
> 12) Got token
> 13) Received the last SYNC message(seq n)
> 14) SYNC done, switch to the normal assembly buffer
> 15) Delivery cfg message(seq n + 1) based on *normal* assembly buffer! BAD!
>
> In this case, cfg message(seq n + 1) will corrupts our *normal* assembly buffer.
>
> Because corosync_sending_allowed() (called by cs_ipcs_msg_process())
> returns QB_TRUE for those IPC connections who are
> CS_LIB_FLOW_CONTROL_NOT_REQUIRED, such as cfg. So cfg message can
> still pass IPC to totem during SYNC, thus, cause the step 6) to
> happen.
>
> One straight way to solve this issue is changing all
> CS_LIB_FLOW_CONTROL_NOT_REQUIRED to CS_LIB_FLOW_CONTROL_REQUIRED for
> cfg service.
>
>
> ---------- Forwarded message ----------
> From: jason <huzhijiang@xxxxxxxxx>
> Date: Fri, Jan 30, 2015 at 7:29 PM
> Subject: Question about message generation/origination during SYNC
> To: "discuss@xxxxxxxxxxxx" <discuss@xxxxxxxxxxxx>
>
>
> Dear All,
>
> By analyzing current corosync code, I found that if some messages can
> be generated from library during SYNC processing(such as
> reload/shutdown over cfgtool, because they are
> CS_LIB_FLOW_CONTROL_NOT_REQUIRED), or in other words, if they can be
> generated on new_message_queue_trans queue because
> instance->waiting_trans_ack was set to 1, then they may have chance to
> be originated after the last SYNC message. In this situation, they
> will be delivered after instance->waiting_trans_ack and
> totempg_waiting_transack set back to 0, then the assembly for the
> normal messages will be used to defrage  those messages, not the
> expected assembly for the trans messages. This may finally result in
> lost normal messages due to fragment number is not equal to assembly
> last_frag_num.
>
> Please have a look if this really a problem or I have missed something?
>
> Thank you!
>
>
>
> --
> Yours,
> Jason
>
>
> --
> Yours,
> Jason



-- 
Yours,
Jason
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss



[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux