Re: failing assert(addrlen) in totemip_equal()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1/2/2013 11:22 AM, Jan Friesse wrote:

> I believe main problem is, that we are using HMAC, BUT without ever
> initializing HMAC key (so you will not get HASH fault, but rather
> DECRYPT fault, and there CAN happen that data are decrypted (and
> interpreted) into total garbage).
> 
> So main question is (fabbione). Shouldn't we use HMAC with initialized
> key (in init_nss_hash, hash_param.data points to key, and hash_param.len
> to key_len) (as stated in
> http://www.mozilla.org/projects/security/pki/nss/tech-notes/tn5.html in
> HMAC section)? I believe this would solve problem Jon is hitting.

Well spotted.

The fix is "easy" as we just need to change the pointers in
init_nss_hash but:

1) it would break onwire compat because all of a sudden the HMAC key
will change and therefor the generated signature on the packets.

we now have capabilities to deal with it, but you need to tell me how
you prefer to handle it.

2) there is a more stupid error in totemudp*.c that we need to address too.

>         res = crypto_authenticate_and_decrypt (instance->crypto_inst, iovec->iov_base, &bytes_received);
>         if (res == -1) {
>                 log_printf (instance->totemudp_log_level_security, "Received message has invalid digest... ignoring.");
>                 log_printf (instance->totemudp_log_level_security,
>                         "Invalid packet data");
>                 iovec->iov_len = FRAME_SIZE_MAX;
>                 return 0;
>         }

a) We return 0 even if we fail to process the packet in crypto.

b) I can´t find the code where we check for net_deliver_fn return
codes... do we do that at all or we rely on a well formed packet being
in the buffer? If so we might have to nuke the buffer if crypto fails or
pass the error via iov_len.

Making a patch is easy, just let me know how you prefer to handle it.

Fabio


> 
> Regards,
>   Honza
> 
> Burgess, Jon napsal(a):
>> After trying a few more combinations I think the trigger for the problem is
>> the encrypted traffic with mismatched authkeys. If two machines are running
>> in this state then one typically hits this assert in a few minutes. 
>>
>> Restarting the processes and making/breaking the membership by swapping the
>> keys doesn't seem to be necessary.
>>
>>     Jon
>>
>>
>> -----Original Message-----
>> From: Burgess, Jon 
>> Sent: 14 December 2012 15:13
>> To: 'Jan Friesse'
>> Cc: discuss@xxxxxxxxxxxx
>> Subject: RE:  failing assert(addrlen) in totemip_equal()
>>
>>> Jon,
>>> this is new bug. Can you please send me config and describe how exactly
>>> are you restarting corosync? (script, kill -INT pid, ...)
>>
>> I can reproduce the crash with the script below in about 5 minutes on a 
>> pair of F17 VMs and corosync-2.0.2-1.fc17.x86_64
>>
>> The Corosync configuration has encryption enabled (aes256+sha1)  and 
>> the script alternates between two different keys, one which matches the 
>> peer and another which doesn't:
>>
>>
>> #!/bin/bash
>>
>> while(true)
>> do
>>     echo "start good"
>>     cp /etc/corosync/authkey.good  /etc/corosync/authkey
>>     corosync
>>     usleep $(( $RANDOM * 100 ))
>>     echo "stop"
>>     killall -9 corosync
>>     usleep $(( $RANDOM * 100 ))
>>
>>     echo "start bad"
>>     cp /etc/corosync/authkey.bad  /etc/corosync/authkey
>>     corosync
>>     usleep $(( $RANDOM * 100 ))
>>     echo "stop"
>>     killall -9 corosync
>>     usleep $(( $RANDOM * 100 ))
>>
>> done
>>
>>
>> After this  has been running for a few minutes the peer crashes:
>>
>> Dec 14 13:54:48 notice  [QUORUM] vsf_quorum.c:132 Members[2]: 1702537408
>> 1719314624
>> Dec 14 13:54:48 notice  [TOTEM ] totemsrp.c:1909 A processor joined or left
>> the membership and a new membership (192.168.122.101:2824) was formed.
>> Dec 14 13:54:48 notice  [QUORUM] vsf_quorum.c:148 This node is within the
>> primary component and will provide service.
>> Dec 14 13:54:48 notice  [QUORUM] vsf_quorum.c:132 Members[2]: 1702537408
>> 1719314624
>> Dec 14 13:54:48 notice  [MAIN  ] main.c:252 Completed service
>> synchronization, ready to provide service.
>> Dec 14 13:54:50 alert   [TOTEM ] totemcrypto.c:354 PK11_DigestFinal
>> (decrypt) failed (err -8023)
>> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:444 Received message has invalid
>> digest... ignoring.
>> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:446 Invalid packet data
>> Dec 14 13:54:50 alert   [TOTEM ] totemcrypto.c:354 PK11_DigestFinal
>> (decrypt) failed (err -8023)
>> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:444 Received message has invalid
>> digest... ignoring.
>> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:446 Invalid packet data
>> Dec 14 13:54:50 notice  [TOTEM ] totemsrp.c:1610 A processor failed, forming
>> new configuration.
>> Dec 14 13:54:50 alert   [TOTEM ] totemcrypto.c:354 PK11_DigestFinal
>> (decrypt) failed (err -8023)
>> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:444 Received message has invalid
>> digest... ignoring.
>> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:446 Invalid packet data
>> corosync: totemip.c:106: totemip_equal: Assertion `addrlen' failed.
>> Ringbuffer:
>>  ->OVERWRITE
>>  ->write_pt [648645]
>>  ->read_pt [0]
>>  ->size [2097152 words]
>>  =>free [5794024 bytes]
>>  =>used [2594580 bytes]
>> Aborted (core dumped)
>>
>>
>>
>>> Also if you have coredumps, it would be interesting to see not only
>>> srp_addr->addr (because only [0] is checked), but also
>>> memb_join->proc_list_entries, 
>>
>> (gdb) p memb_join->proc_list_entries
>> $10 = 2
>>
>>
>>> instance->my_proc_list_entries (should be
>>> same) 
>>
>> (gdb) p instance->my_proc_list_entries
>> $18 = 2
>>
>>> and full content of proc_list array (must have
>>> memb_join->proc_list_entries entries) 
>>
>> I assume you mean the "proc_list" local variable in memb_join_process()
>>
>> (gdb) p proc_list
>> $9 = <value optimized out>
>>
>> But...
>>         proc_list = (struct srp_addr *)memb_join->end_of_memb_join;
>>
>> The contents of proc_list should be the same as this:
>>
>> (gdb) p *(struct srp_addr *)memb_join->end_of_memb_join
>> $22 = {no_addrs = 2 '\002', addr = {{nodeid = 1, family = 2, addr =
>> "\251\376\000\001", '\000' <repeats 11 times>}, {nodeid = 0, family = 0,
>>       addr = '\000' <repeats 15 times>}}}
>>
>>
>>> and instance->my_proc_list (should
>>> have instance->my_proc_list_entries entries).
>>
>> (gdb) p instance->my_proc_list_entries
>> $11 = 2
>>
>> (gdb) p instance->my_proc_list
>> $12 = {{no_addrs = 2 '\002', addr = {{nodeid = 1, family = 2, addr =
>> "\251\376\000\001", '\000' <repeats 11 times>}, {nodeid = 0, family = 0,
>>         addr = '\000' <repeats 15 times>}}}, {no_addrs = 64 '@', addr =
>> {{nodeid = 2171259156, family = 34025,
>>         addr = "\371[E\341t\277\006\360\031\350\256 \022\356", <incomplete
>> sequence \343>}, {nodeid = 809278438, family = 58892,
>>         addr = "\004q3\267\030bC\243\255hm)\377\036\316\060"}}}, {no_addrs =
>> 0 '\000', addr = {{nodeid = 0, family = 0,
>>         addr = '\000' <repeats 15 times>}, {nodeid = 0, family = 0, addr =
>> '\000' <repeats 15 times>}}} <repeats 382 times>}
>>
>>
>>
>>> First entry in proc_list and instance->my_proc_list looks same.
>>
>> Something I forgot to mention last time was that the assert triggers when it
>> tries to compare the two 
>> zeroed addresses at the [1] position in the list:
>>
>> (gdb) up
>> #5  memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at
>> totemsrp.c:1291
>> 1291                            if (srp_addr_equal (&set1[j], &set2[i])) {
>>  (gdb) p i
>> $6 = 1
>> (gdb) p j
>> $7 = 1
>>
>>
>>>
>>> Regards,
>>>   Honza
>>
>>
>>
>> Burgess, Jon napsal(a):
>>> I have been doing some tests which involve breaking the connection between
>> two nodes by restarting Corosync and occasionally I see the code failing the
>> assert(addrlen) in totemip_equal(). I have hit this a couple of times now
>> but I'm not sure exactly how reproducible it is. This is with
>> Corosync-2.0.1. 
>>>
>>> (gdb) bt
>>> #0  0x00007f3681ad9da5 in raise (sig=<value optimized out>) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
>>> #1  0x00007f3681adb2c3 in abort () at abort.c:88
>>> #2  0x00007f3681ad2d99 in __assert_fail (assertion=0x7f36838c7278
>> "addrlen", file=0x7f36838c726e "totemip.c", line=106,
>>>     function=0x7f36838c7290 "totemip_equal") at assert.c:78
>>> #3  0x00007f36838b5229 in totemip_equal (addr1=<value optimized out>,
>> addr2=<value optimized out>) at totemip.c:106
>>> #4  0x00007f36838c2e35 in srp_addr_equal (instance=0x7f3683ea8010,
>> memb_join=0x743e78) at totemsrp.c:1114
>>> #5  memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at
>> totemsrp.c:1291
>>> #6  memb_join_process (instance=0x7f3683ea8010, memb_join=0x743e78) at
>> totemsrp.c:4047
>>> #7  0x00007f36838c35bc in message_handler_memb_join
>> (instance=0x7f3683ea8010, msg=<value optimized out>, msg_len=<value
>> optimized out>,
>>>     endian_conversion_needed=<value optimized out>) at totemsrp.c:4304
>>> #8  0x00007f36838b9748 in rrp_deliver_fn (context=0x7028c0, msg=0x743e78,
>> msg_len=159) at totemrrp.c:1792
>>> #9  0x00007f36838b6c61 in net_deliver_fn (fd=<value optimized out>,
>> revents=<value optimized out>, data=<value optimized out>) at totemudp.c:465
>>> #10 0x00007f3683457c1f in ?? ()
>>>
>>> Dumping out the lists being compared shows that they both have two
>> entries. The first entry in both cases is the local node. The second entry
>> is zeroed.
>>>
>>> (gdb) p instance->my_proc_list->addr[0]
>>> $16 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats
>> 11 times>}
>>> (gdb) p instance->my_proc_list->addr[1]
>>> $17 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}
>>> (gdb) p instance->my_proc_list_entries
>>> $18 = 2
>>>
>>> (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[0]
>>> $23 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats
>> 11 times>}
>>> (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[1]
>>> $24 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}
>>> (gdb) p memb_join->proc_list_entries
>>> $10 = 2
>>>
>>> The log messages show the local now forming and breaking associations with
>> the peer node but nothing which looks obviously wrong:
>>>
>>> [QUORUM] This node is within the non-primary component and will NOT
>> provide any services.
>>> [QUORUM] Members[1]: 1
>>> [QUORUM] Members[1]: 1
>>> [TOTEM ] A processor joined or left the membership and a new membership
>> (169.254.0.1:316) was formed.
>>> [MAIN  ] Completed service synchronization, ready to provide service.
>>> [QUORUM] Members[2]: 1 3
>>> [TOTEM ] A processor joined or left the membership and a new membership
>> (169.254.0.1:324) was formed.
>>> [QUORUM] This node is within the primary component and will provide
>> service.
>>> [QUORUM] Members[2]: 1 3
>>> [MAIN  ] Completed service synchronization, ready to provide service.
>>> [QUORUM] This node is within the non-primary component and will NOT
>> provide any services.
>>> [QUORUM] Members[1]: 1
>>> [QUORUM] Members[1]: 1
>>> [TOTEM ] A processor joined or left the membership and a new membership
>> (169.254.0.1:328) was formed.
>>> [MAIN  ] Completed service synchronization, ready to provide service.
>>>
>>> Does this seem familiar to anyone? Might it be fixed already in a newer
>> release?
>>>
>>> 	Jon
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss



[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux