Re: Question about message generation/origination during SYNC

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jason,
thanks for prove.

I know the problem exists. The thing is, there are some calls that doesn't send any totem messages. These may be marked as CS_LIB_FLOW_CONTROL_NOT_REQUIRED. Any function that does send totem message has to be marked as CS_LIB_FLOW_CONTROL_REQUIRED. There is one exception in votequorum, because there is need to send totem message during sync, but then order of messages is not guaranteed (it's explained in the man page).

So if you want to go over IPC calls and mark them CS_LIB_FLOW_CONTROL_REQUIRED where required, I'm more than happy to ACK your patch.

Regards,
  Honza

jason napsal(a):
  Hi All,

I have made two patches (attached) to prove this issue really exist.

In prove_cfg_can_send_messages_through_totem_during_sync.patch, it
first comments out the functionality of ipc_allow_connections to make
the first sync procedure the same as its subsequents, thus, make
ipc_allow_connections always true. Then it lets cpg_sync_process() to
always returns -1 in order to let sync do not stop. So with this
patch, during corosync startup in a simple one node configuration, one
can run corosync-cfgtool -R during sync. The result is that cfg
command CAN pass through ipc and totem to send reload message out then
do reload. So this patch can prove that cfg(library) can send messages
through totem during sync without any restriction.

In simualte_a_cfg_message_during_sync_which_breaks_defrag.patch(do not
need to base on the first one), it simulates a cfg message(named as
breaker message in patch), and sends it right after sending every sync
barrier messages. So the breaker message which is sent after the last
sync barrier message(which cause sync_synchronization_completed())
will be sent in the sync procedure but be received by
assembly_list_inuse/assembly_list_free, not the expected
assembly_list_inuse_trans/assembly_list_free_trans, and will finally
cause a  "fragmented continuation %u is not equal to assembly
last_frag_num..." log. I also place a assert(0) after that log in this
patch to make it easy to see.  This can be reporduced by patching this
patch and simply run a single node configuration.

I hope these two patches can make it easy to illustrate this issue.

On Fri, Jun 5, 2015 at 12:05 AM, jason <huzhijiang@xxxxxxxxx> wrote:
Please consider the possibility of this potential issue. As I
understand, things may happen like this:
a two-node cluster, with Node A and B.

Node A                    Node B
                                1) Send the last SYNC message(seq n)
                                2) Send the token
3) Got token
4) Lost the last SYNC message(seq n)
5) SYNC can not be done, need wait retransmit
6) Originate a cfg message(seq n + 1) based on *trans* assembly buffer
7) Can not immediatly delivery cfg message (seq n + 1) based on trans
assembly buffer
8) Send the token
                                9) Got token
                                10) retransmit last SYNC message(seq n)
                                11) Send the token
12) Got token
13) Received the last SYNC message(seq n)
14) SYNC done, switch to the normal assembly buffer
15) Delivery cfg message(seq n + 1) based on *normal* assembly buffer! BAD!

In this case, cfg message(seq n + 1) will corrupts our *normal* assembly buffer.

Because corosync_sending_allowed() (called by cs_ipcs_msg_process())
returns QB_TRUE for those IPC connections who are
CS_LIB_FLOW_CONTROL_NOT_REQUIRED, such as cfg. So cfg message can
still pass IPC to totem during SYNC, thus, cause the step 6) to
happen.

One straight way to solve this issue is changing all
CS_LIB_FLOW_CONTROL_NOT_REQUIRED to CS_LIB_FLOW_CONTROL_REQUIRED for
cfg service.


---------- Forwarded message ----------
From: jason <huzhijiang@xxxxxxxxx>
Date: Fri, Jan 30, 2015 at 7:29 PM
Subject: Question about message generation/origination during SYNC
To: "discuss@xxxxxxxxxxxx" <discuss@xxxxxxxxxxxx>


Dear All,

By analyzing current corosync code, I found that if some messages can
be generated from library during SYNC processing(such as
reload/shutdown over cfgtool, because they are
CS_LIB_FLOW_CONTROL_NOT_REQUIRED), or in other words, if they can be
generated on new_message_queue_trans queue because
instance->waiting_trans_ack was set to 1, then they may have chance to
be originated after the last SYNC message. In this situation, they
will be delivered after instance->waiting_trans_ack and
totempg_waiting_transack set back to 0, then the assembly for the
normal messages will be used to defrage  those messages, not the
expected assembly for the trans messages. This may finally result in
lost normal messages due to fragment number is not equal to assembly
last_frag_num.

Please have a look if this really a problem or I have missed something?

Thank you!



--
Yours,
Jason


--
Yours,
Jason




_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss



[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux