Re: [PATCH 23/28] votequrorum: Assert sender nodeid is known

"Fabio M. Di Nitto" <fdinitto@xxxxxxxxxx> · Thu, 13 Jun 2013 09:14:17 +0200



On 6/13/2013 9:11 AM, Jan Friesse wrote:
> Fabio M. Di Nitto napsal(a):
>> On 6/13/2013 7:23 AM, Jan Friesse wrote:
>>> Fabio M. Di Nitto napsal(a):
>>>> On 6/12/2013 5:03 PM, Jan Friesse wrote:
>>>>> Signed-off-by: Jan Friesse <jfriesse@xxxxxxxxxx>
>>>>> ---
>>>>>  exec/votequorum.c |    3 +++
>>>>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>>>>
>>>>> diff --git a/exec/votequorum.c b/exec/votequorum.c
>>>>> index b561409..131b734 100644
>>>>> --- a/exec/votequorum.c
>>>>> +++ b/exec/votequorum.c
>>>>> @@ -1634,6 +1634,9 @@ static void message_handler_req_exec_votequorum_nodeinfo (
>>>>>  
>>>>>  	if (nodeid == VOTEQUORUM_QDEVICE_NODEID) {
>>>>>  		struct cluster_node *sender_node = find_node_by_nodeid(sender_nodeid);
>>>>> +
>>>>> +		assert(sender_node != NULL);
>>>>> +
>>>>>  		if ((!cluster_is_quorate) &&
>>>>>  		    (sender_node->flags & NODE_FLAGS_QUORATE)) {
>>>>>  			node->votes = req_exec_quorum_nodeinfo->votes;
>>>>>
>>>>
>>>> I would prefer a proper error checking rather than assert here.
>>>>
>>>
>>> Agree. Actually, coverity found this problem, but I was ensure if it is
>>> not problem (this is what I was assuming because of dereference of
>>> sender_node) and then assert is proper solution OR it is problem and
>>> then this patch starts discussion about proper solution.
>>>
>>>> The logic is that a node always send its own information (to populate
>>>> node for node == sender_nodeid) and then send quorum device info from
>>>> that node.
>>>>
>>>> See static int votequorum_sync_process (void) that´s the one triggering
>>>> the population of the node entries.
>>>>
>>>> In theory, if sync queues the messages as expected, sender_node is never
>>>> NULL. If it is, then i´ll prefer at least to see a message that says:
>>>> "received qdevice info before sender node info\n" and then quit (unless
>>>> we can recover from that status easily).
>>>>
>>>
>>> Actually, because you've wrote that code you are maybe able to see how
>>> to recover easily, but when I was reading that code, it really didn't
>>> look so.
>>
>> Ok so the question turns into:
>>
>> static int votequorum_sync_process (void)
>> {
>>         votequorum_exec_send_nodeinfo(us->node_id); <- message 1
>>         votequorum_exec_send_nodeinfo(VOTEQUORUM_QDEVICE_NODEID); <- 2
>>         if (strlen(qdevice_name)) {
>>
>> votequorum_exec_send_qdevice_reg(VOTEQUORUM_QDEVICE_OPERATION_REGISTER,
>>                                                  qdevice_name);
>>         }
>>         return 0;
>> }
>>
>> given this sequence of events, is there a possibility that message 1 can
>> be delivered after message 2?
>>
> 
> Short answer is NO. Long answer is, if there is such possibility, it's
> ultra high priority bug and must be fixed.
> 
>> If there is that possibility, then we need to review the whole
>> send_nodeinfo logic as there is the assumption (in more than one place
>> IIRC) that they arrive in sequence.
>>
>> If there is the guarantee that they are delivered in the same order,
>> then we can use that assert check to verify it and error out by
>> reporting deeper issues.
> 
> There is a lot of informations logged at the beginning of
> message_handler_req_exec_votequorum_nodeinfo. Assert will cause sigabrt
> so blackbox is created with all that logged informations. -> From my
> point of view, assert is enough, because axiom 1 tells us that msg 1 is
> always before msg 2 and axiom 2 is msg 1 contains informations about
> send_node -> assert never appears and if it appears, it's bug in code.

Ok works for me if you think the info are enough, otherwise just switch to

if (sender_node == NULL) {
  log some extra stuff
  die hard
}

Fabio

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss