Re: [PATCH 23/28] votequrorum: Assert sender nodeid is known

Jan Friesse <jfriesse@xxxxxxxxxx> · Thu, 13 Jun 2013 09:11:46 +0200

Fabio M. Di Nitto napsal(a):
> On 6/13/2013 7:23 AM, Jan Friesse wrote:
>> Fabio M. Di Nitto napsal(a):
>>> On 6/12/2013 5:03 PM, Jan Friesse wrote:
>>>> Signed-off-by: Jan Friesse <jfriesse@xxxxxxxxxx>
>>>> ---
>>>>  exec/votequorum.c |    3 +++
>>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/exec/votequorum.c b/exec/votequorum.c
>>>> index b561409..131b734 100644
>>>> --- a/exec/votequorum.c
>>>> +++ b/exec/votequorum.c
>>>> @@ -1634,6 +1634,9 @@ static void message_handler_req_exec_votequorum_nodeinfo (
>>>>  
>>>>  	if (nodeid == VOTEQUORUM_QDEVICE_NODEID) {
>>>>  		struct cluster_node *sender_node = find_node_by_nodeid(sender_nodeid);
>>>> +
>>>> +		assert(sender_node != NULL);
>>>> +
>>>>  		if ((!cluster_is_quorate) &&
>>>>  		    (sender_node->flags & NODE_FLAGS_QUORATE)) {
>>>>  			node->votes = req_exec_quorum_nodeinfo->votes;
>>>>
>>>
>>> I would prefer a proper error checking rather than assert here.
>>>
>>
>> Agree. Actually, coverity found this problem, but I was ensure if it is
>> not problem (this is what I was assuming because of dereference of
>> sender_node) and then assert is proper solution OR it is problem and
>> then this patch starts discussion about proper solution.
>>
>>> The logic is that a node always send its own information (to populate
>>> node for node == sender_nodeid) and then send quorum device info from
>>> that node.
>>>
>>> See static int votequorum_sync_process (void) that´s the one triggering
>>> the population of the node entries.
>>>
>>> In theory, if sync queues the messages as expected, sender_node is never
>>> NULL. If it is, then i´ll prefer at least to see a message that says:
>>> "received qdevice info before sender node info\n" and then quit (unless
>>> we can recover from that status easily).
>>>
>>
>> Actually, because you've wrote that code you are maybe able to see how
>> to recover easily, but when I was reading that code, it really didn't
>> look so.
> 
> Ok so the question turns into:
> 
> static int votequorum_sync_process (void)
> {
>         votequorum_exec_send_nodeinfo(us->node_id); <- message 1
>         votequorum_exec_send_nodeinfo(VOTEQUORUM_QDEVICE_NODEID); <- 2
>         if (strlen(qdevice_name)) {
> 
> votequorum_exec_send_qdevice_reg(VOTEQUORUM_QDEVICE_OPERATION_REGISTER,
>                                                  qdevice_name);
>         }
>         return 0;
> }
> 
> given this sequence of events, is there a possibility that message 1 can
> be delivered after message 2?
> 

Short answer is NO. Long answer is, if there is such possibility, it's
ultra high priority bug and must be fixed.

> If there is that possibility, then we need to review the whole
> send_nodeinfo logic as there is the assumption (in more than one place
> IIRC) that they arrive in sequence.
> 
> If there is the guarantee that they are delivered in the same order,
> then we can use that assert check to verify it and error out by
> reporting deeper issues.

There is a lot of informations logged at the beginning of
message_handler_req_exec_votequorum_nodeinfo. Assert will cause sigabrt
so blackbox is created with all that logged informations. -> From my
point of view, assert is enough, because axiom 1 tells us that msg 1 is
always before msg 2 and axiom 2 is msg 1 contains informations about
send_node -> assert never appears and if it appears, it's bug in code.

> 
> Fabio

Honza
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss