Re: failing assert(addrlen) in totemip_equal()

"Fabio M. Di Nitto" <fdinitto@xxxxxxxxxx> · Wed, 02 Jan 2013 14:05:54 +0100

Hi Honza,

i'll take a look when I am back from holidays next week.

Fabio

On 01/02/2013 11:22 AM, Jan Friesse wrote:
> Jon,
> thanks for digging into problem. Generally, it's bad idea to use
> different authkeys (thats why we logging so many decrypt errors).
> Corosync itself is NOT handling byzantine faults (what is happening) and
> this is why problem appears. On the other hand, we are using digest to
> prevent some kind of byzantine failures (and one you see should be that
> kind of failure).
> 
> I believe main problem is, that we are using HMAC, BUT without ever
> initializing HMAC key (so you will not get HASH fault, but rather
> DECRYPT fault, and there CAN happen that data are decrypted (and
> interpreted) into total garbage).
> 
> So main question is (fabbione). Shouldn't we use HMAC with initialized
> key (in init_nss_hash, hash_param.data points to key, and hash_param.len
> to key_len) (as stated in
> http://www.mozilla.org/projects/security/pki/nss/tech-notes/tn5.html in
> HMAC section)? I believe this would solve problem Jon is hitting.
> 
> Regards,
>   Honza
> 
> Burgess, Jon napsal(a):
>> After trying a few more combinations I think the trigger for the problem is
>> the encrypted traffic with mismatched authkeys. If two machines are running
>> in this state then one typically hits this assert in a few minutes. 
>>
>> Restarting the processes and making/breaking the membership by swapping the
>> keys doesn't seem to be necessary.
>>
>>     Jon
>>
>>
>> -----Original Message-----
>> From: Burgess, Jon 
>> Sent: 14 December 2012 15:13
>> To: 'Jan Friesse'
>> Cc: discuss@xxxxxxxxxxxx
>> Subject: RE:  failing assert(addrlen) in totemip_equal()
>>
>>> Jon,
>>> this is new bug. Can you please send me config and describe how exactly
>>> are you restarting corosync? (script, kill -INT pid, ...)
>>
>> I can reproduce the crash with the script below in about 5 minutes on a 
>> pair of F17 VMs and corosync-2.0.2-1.fc17.x86_64
>>
>> The Corosync configuration has encryption enabled (aes256+sha1)  and 
>> the script alternates between two different keys, one which matches the 
>> peer and another which doesn't:
>>
>>
>> #!/bin/bash
>>
>> while(true)
>> do
>>     echo "start good"
>>     cp /etc/corosync/authkey.good  /etc/corosync/authkey
>>     corosync
>>     usleep $(( $RANDOM * 100 ))
>>     echo "stop"
>>     killall -9 corosync
>>     usleep $(( $RANDOM * 100 ))
>>
>>     echo "start bad"
>>     cp /etc/corosync/authkey.bad  /etc/corosync/authkey
>>     corosync
>>     usleep $(( $RANDOM * 100 ))
>>     echo "stop"
>>     killall -9 corosync
>>     usleep $(( $RANDOM * 100 ))
>>
>> done
>>
>>
>> After this  has been running for a few minutes the peer crashes:
>>
>> Dec 14 13:54:48 notice  [QUORUM] vsf_quorum.c:132 Members[2]: 1702537408
>> 1719314624
>> Dec 14 13:54:48 notice  [TOTEM ] totemsrp.c:1909 A processor joined or left
>> the membership and a new membership (192.168.122.101:2824) was formed.
>> Dec 14 13:54:48 notice  [QUORUM] vsf_quorum.c:148 This node is within the
>> primary component and will provide service.
>> Dec 14 13:54:48 notice  [QUORUM] vsf_quorum.c:132 Members[2]: 1702537408
>> 1719314624
>> Dec 14 13:54:48 notice  [MAIN  ] main.c:252 Completed service
>> synchronization, ready to provide service.
>> Dec 14 13:54:50 alert   [TOTEM ] totemcrypto.c:354 PK11_DigestFinal
>> (decrypt) failed (err -8023)
>> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:444 Received message has invalid
>> digest... ignoring.
>> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:446 Invalid packet data
>> Dec 14 13:54:50 alert   [TOTEM ] totemcrypto.c:354 PK11_DigestFinal
>> (decrypt) failed (err -8023)
>> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:444 Received message has invalid
>> digest... ignoring.
>> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:446 Invalid packet data
>> Dec 14 13:54:50 notice  [TOTEM ] totemsrp.c:1610 A processor failed, forming
>> new configuration.
>> Dec 14 13:54:50 alert   [TOTEM ] totemcrypto.c:354 PK11_DigestFinal
>> (decrypt) failed (err -8023)
>> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:444 Received message has invalid
>> digest... ignoring.
>> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:446 Invalid packet data
>> corosync: totemip.c:106: totemip_equal: Assertion `addrlen' failed.
>> Ringbuffer:
>>  ->OVERWRITE
>>  ->write_pt [648645]
>>  ->read_pt [0]
>>  ->size [2097152 words]
>>  =>free [5794024 bytes]
>>  =>used [2594580 bytes]
>> Aborted (core dumped)
>>
>>
>>
>>> Also if you have coredumps, it would be interesting to see not only
>>> srp_addr->addr (because only [0] is checked), but also
>>> memb_join->proc_list_entries, 
>>
>> (gdb) p memb_join->proc_list_entries
>> $10 = 2
>>
>>
>>> instance->my_proc_list_entries (should be
>>> same) 
>>
>> (gdb) p instance->my_proc_list_entries
>> $18 = 2
>>
>>> and full content of proc_list array (must have
>>> memb_join->proc_list_entries entries) 
>>
>> I assume you mean the "proc_list" local variable in memb_join_process()
>>
>> (gdb) p proc_list
>> $9 = <value optimized out>
>>
>> But...
>>         proc_list = (struct srp_addr *)memb_join->end_of_memb_join;
>>
>> The contents of proc_list should be the same as this:
>>
>> (gdb) p *(struct srp_addr *)memb_join->end_of_memb_join
>> $22 = {no_addrs = 2 '\002', addr = {{nodeid = 1, family = 2, addr =
>> "\251\376\000\001", '\000' <repeats 11 times>}, {nodeid = 0, family = 0,
>>       addr = '\000' <repeats 15 times>}}}
>>
>>
>>> and instance->my_proc_list (should
>>> have instance->my_proc_list_entries entries).
>>
>> (gdb) p instance->my_proc_list_entries
>> $11 = 2
>>
>> (gdb) p instance->my_proc_list
>> $12 = {{no_addrs = 2 '\002', addr = {{nodeid = 1, family = 2, addr =
>> "\251\376\000\001", '\000' <repeats 11 times>}, {nodeid = 0, family = 0,
>>         addr = '\000' <repeats 15 times>}}}, {no_addrs = 64 '@', addr =
>> {{nodeid = 2171259156, family = 34025,
>>         addr = "\371[E\341t\277\006\360\031\350\256 \022\356", <incomplete
>> sequence \343>}, {nodeid = 809278438, family = 58892,
>>         addr = "\004q3\267\030bC\243\255hm)\377\036\316\060"}}}, {no_addrs =
>> 0 '\000', addr = {{nodeid = 0, family = 0,
>>         addr = '\000' <repeats 15 times>}, {nodeid = 0, family = 0, addr =
>> '\000' <repeats 15 times>}}} <repeats 382 times>}
>>
>>
>>
>>> First entry in proc_list and instance->my_proc_list looks same.
>>
>> Something I forgot to mention last time was that the assert triggers when it
>> tries to compare the two 
>> zeroed addresses at the [1] position in the list:
>>
>> (gdb) up
>> #5  memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at
>> totemsrp.c:1291
>> 1291                            if (srp_addr_equal (&set1[j], &set2[i])) {
>>  (gdb) p i
>> $6 = 1
>> (gdb) p j
>> $7 = 1
>>
>>
>>>
>>> Regards,
>>>   Honza
>>
>>
>>
>> Burgess, Jon napsal(a):
>>> I have been doing some tests which involve breaking the connection between
>> two nodes by restarting Corosync and occasionally I see the code failing the
>> assert(addrlen) in totemip_equal(). I have hit this a couple of times now
>> but I'm not sure exactly how reproducible it is. This is with
>> Corosync-2.0.1. 
>>>
>>> (gdb) bt
>>> #0  0x00007f3681ad9da5 in raise (sig=<value optimized out>) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
>>> #1  0x00007f3681adb2c3 in abort () at abort.c:88
>>> #2  0x00007f3681ad2d99 in __assert_fail (assertion=0x7f36838c7278
>> "addrlen", file=0x7f36838c726e "totemip.c", line=106,
>>>     function=0x7f36838c7290 "totemip_equal") at assert.c:78
>>> #3  0x00007f36838b5229 in totemip_equal (addr1=<value optimized out>,
>> addr2=<value optimized out>) at totemip.c:106
>>> #4  0x00007f36838c2e35 in srp_addr_equal (instance=0x7f3683ea8010,
>> memb_join=0x743e78) at totemsrp.c:1114
>>> #5  memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at
>> totemsrp.c:1291
>>> #6  memb_join_process (instance=0x7f3683ea8010, memb_join=0x743e78) at
>> totemsrp.c:4047
>>> #7  0x00007f36838c35bc in message_handler_memb_join
>> (instance=0x7f3683ea8010, msg=<value optimized out>, msg_len=<value
>> optimized out>,
>>>     endian_conversion_needed=<value optimized out>) at totemsrp.c:4304
>>> #8  0x00007f36838b9748 in rrp_deliver_fn (context=0x7028c0, msg=0x743e78,
>> msg_len=159) at totemrrp.c:1792
>>> #9  0x00007f36838b6c61 in net_deliver_fn (fd=<value optimized out>,
>> revents=<value optimized out>, data=<value optimized out>) at totemudp.c:465
>>> #10 0x00007f3683457c1f in ?? ()
>>>
>>> Dumping out the lists being compared shows that they both have two
>> entries. The first entry in both cases is the local node. The second entry
>> is zeroed.
>>>
>>> (gdb) p instance->my_proc_list->addr[0]
>>> $16 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats
>> 11 times>}
>>> (gdb) p instance->my_proc_list->addr[1]
>>> $17 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}
>>> (gdb) p instance->my_proc_list_entries
>>> $18 = 2
>>>
>>> (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[0]
>>> $23 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats
>> 11 times>}
>>> (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[1]
>>> $24 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}
>>> (gdb) p memb_join->proc_list_entries
>>> $10 = 2
>>>
>>> The log messages show the local now forming and breaking associations with
>> the peer node but nothing which looks obviously wrong:
>>>
>>> [QUORUM] This node is within the non-primary component and will NOT
>> provide any services.
>>> [QUORUM] Members[1]: 1
>>> [QUORUM] Members[1]: 1
>>> [TOTEM ] A processor joined or left the membership and a new membership
>> (169.254.0.1:316) was formed.
>>> [MAIN  ] Completed service synchronization, ready to provide service.
>>> [QUORUM] Members[2]: 1 3
>>> [TOTEM ] A processor joined or left the membership and a new membership
>> (169.254.0.1:324) was formed.
>>> [QUORUM] This node is within the primary component and will provide
>> service.
>>> [QUORUM] Members[2]: 1 3
>>> [MAIN  ] Completed service synchronization, ready to provide service.
>>> [QUORUM] This node is within the non-primary component and will NOT
>> provide any services.
>>> [QUORUM] Members[1]: 1
>>> [QUORUM] Members[1]: 1
>>> [TOTEM ] A processor joined or left the membership and a new membership
>> (169.254.0.1:328) was formed.
>>> [MAIN  ] Completed service synchronization, ready to provide service.
>>>
>>> Does this seem familiar to anyone? Might it be fixed already in a newer
>> release?
>>>
>>> 	Jon
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss