Re: failing assert(addrlen) in totemip_equal()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jon,
thanks for digging into problem. Generally, it's bad idea to use
different authkeys (thats why we logging so many decrypt errors).
Corosync itself is NOT handling byzantine faults (what is happening) and
this is why problem appears. On the other hand, we are using digest to
prevent some kind of byzantine failures (and one you see should be that
kind of failure).

I believe main problem is, that we are using HMAC, BUT without ever
initializing HMAC key (so you will not get HASH fault, but rather
DECRYPT fault, and there CAN happen that data are decrypted (and
interpreted) into total garbage).

So main question is (fabbione). Shouldn't we use HMAC with initialized
key (in init_nss_hash, hash_param.data points to key, and hash_param.len
to key_len) (as stated in
http://www.mozilla.org/projects/security/pki/nss/tech-notes/tn5.html in
HMAC section)? I believe this would solve problem Jon is hitting.

Regards,
  Honza

Burgess, Jon napsal(a):
> After trying a few more combinations I think the trigger for the problem is
> the encrypted traffic with mismatched authkeys. If two machines are running
> in this state then one typically hits this assert in a few minutes. 
> 
> Restarting the processes and making/breaking the membership by swapping the
> keys doesn't seem to be necessary.
> 
>     Jon
> 
> 
> -----Original Message-----
> From: Burgess, Jon 
> Sent: 14 December 2012 15:13
> To: 'Jan Friesse'
> Cc: discuss@xxxxxxxxxxxx
> Subject: RE:  failing assert(addrlen) in totemip_equal()
> 
>> Jon,
>> this is new bug. Can you please send me config and describe how exactly
>> are you restarting corosync? (script, kill -INT pid, ...)
> 
> I can reproduce the crash with the script below in about 5 minutes on a 
> pair of F17 VMs and corosync-2.0.2-1.fc17.x86_64
> 
> The Corosync configuration has encryption enabled (aes256+sha1)  and 
> the script alternates between two different keys, one which matches the 
> peer and another which doesn't:
> 
> 
> #!/bin/bash
> 
> while(true)
> do
>     echo "start good"
>     cp /etc/corosync/authkey.good  /etc/corosync/authkey
>     corosync
>     usleep $(( $RANDOM * 100 ))
>     echo "stop"
>     killall -9 corosync
>     usleep $(( $RANDOM * 100 ))
> 
>     echo "start bad"
>     cp /etc/corosync/authkey.bad  /etc/corosync/authkey
>     corosync
>     usleep $(( $RANDOM * 100 ))
>     echo "stop"
>     killall -9 corosync
>     usleep $(( $RANDOM * 100 ))
> 
> done
> 
> 
> After this  has been running for a few minutes the peer crashes:
> 
> Dec 14 13:54:48 notice  [QUORUM] vsf_quorum.c:132 Members[2]: 1702537408
> 1719314624
> Dec 14 13:54:48 notice  [TOTEM ] totemsrp.c:1909 A processor joined or left
> the membership and a new membership (192.168.122.101:2824) was formed.
> Dec 14 13:54:48 notice  [QUORUM] vsf_quorum.c:148 This node is within the
> primary component and will provide service.
> Dec 14 13:54:48 notice  [QUORUM] vsf_quorum.c:132 Members[2]: 1702537408
> 1719314624
> Dec 14 13:54:48 notice  [MAIN  ] main.c:252 Completed service
> synchronization, ready to provide service.
> Dec 14 13:54:50 alert   [TOTEM ] totemcrypto.c:354 PK11_DigestFinal
> (decrypt) failed (err -8023)
> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:444 Received message has invalid
> digest... ignoring.
> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:446 Invalid packet data
> Dec 14 13:54:50 alert   [TOTEM ] totemcrypto.c:354 PK11_DigestFinal
> (decrypt) failed (err -8023)
> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:444 Received message has invalid
> digest... ignoring.
> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:446 Invalid packet data
> Dec 14 13:54:50 notice  [TOTEM ] totemsrp.c:1610 A processor failed, forming
> new configuration.
> Dec 14 13:54:50 alert   [TOTEM ] totemcrypto.c:354 PK11_DigestFinal
> (decrypt) failed (err -8023)
> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:444 Received message has invalid
> digest... ignoring.
> Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:446 Invalid packet data
> corosync: totemip.c:106: totemip_equal: Assertion `addrlen' failed.
> Ringbuffer:
>  ->OVERWRITE
>  ->write_pt [648645]
>  ->read_pt [0]
>  ->size [2097152 words]
>  =>free [5794024 bytes]
>  =>used [2594580 bytes]
> Aborted (core dumped)
> 
> 
> 
>> Also if you have coredumps, it would be interesting to see not only
>> srp_addr->addr (because only [0] is checked), but also
>> memb_join->proc_list_entries, 
> 
> (gdb) p memb_join->proc_list_entries
> $10 = 2
> 
> 
>> instance->my_proc_list_entries (should be
>> same) 
> 
> (gdb) p instance->my_proc_list_entries
> $18 = 2
> 
>> and full content of proc_list array (must have
>> memb_join->proc_list_entries entries) 
> 
> I assume you mean the "proc_list" local variable in memb_join_process()
> 
> (gdb) p proc_list
> $9 = <value optimized out>
> 
> But...
>         proc_list = (struct srp_addr *)memb_join->end_of_memb_join;
> 
> The contents of proc_list should be the same as this:
> 
> (gdb) p *(struct srp_addr *)memb_join->end_of_memb_join
> $22 = {no_addrs = 2 '\002', addr = {{nodeid = 1, family = 2, addr =
> "\251\376\000\001", '\000' <repeats 11 times>}, {nodeid = 0, family = 0,
>       addr = '\000' <repeats 15 times>}}}
> 
> 
>> and instance->my_proc_list (should
>> have instance->my_proc_list_entries entries).
> 
> (gdb) p instance->my_proc_list_entries
> $11 = 2
> 
> (gdb) p instance->my_proc_list
> $12 = {{no_addrs = 2 '\002', addr = {{nodeid = 1, family = 2, addr =
> "\251\376\000\001", '\000' <repeats 11 times>}, {nodeid = 0, family = 0,
>         addr = '\000' <repeats 15 times>}}}, {no_addrs = 64 '@', addr =
> {{nodeid = 2171259156, family = 34025,
>         addr = "\371[E\341t\277\006\360\031\350\256 \022\356", <incomplete
> sequence \343>}, {nodeid = 809278438, family = 58892,
>         addr = "\004q3\267\030bC\243\255hm)\377\036\316\060"}}}, {no_addrs =
> 0 '\000', addr = {{nodeid = 0, family = 0,
>         addr = '\000' <repeats 15 times>}, {nodeid = 0, family = 0, addr =
> '\000' <repeats 15 times>}}} <repeats 382 times>}
> 
> 
> 
>> First entry in proc_list and instance->my_proc_list looks same.
> 
> Something I forgot to mention last time was that the assert triggers when it
> tries to compare the two 
> zeroed addresses at the [1] position in the list:
> 
> (gdb) up
> #5  memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at
> totemsrp.c:1291
> 1291                            if (srp_addr_equal (&set1[j], &set2[i])) {
>  (gdb) p i
> $6 = 1
> (gdb) p j
> $7 = 1
> 
> 
>>
>> Regards,
>>   Honza
> 
> 
> 
> Burgess, Jon napsal(a):
>> I have been doing some tests which involve breaking the connection between
> two nodes by restarting Corosync and occasionally I see the code failing the
> assert(addrlen) in totemip_equal(). I have hit this a couple of times now
> but I'm not sure exactly how reproducible it is. This is with
> Corosync-2.0.1. 
>>
>> (gdb) bt
>> #0  0x00007f3681ad9da5 in raise (sig=<value optimized out>) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
>> #1  0x00007f3681adb2c3 in abort () at abort.c:88
>> #2  0x00007f3681ad2d99 in __assert_fail (assertion=0x7f36838c7278
> "addrlen", file=0x7f36838c726e "totemip.c", line=106,
>>     function=0x7f36838c7290 "totemip_equal") at assert.c:78
>> #3  0x00007f36838b5229 in totemip_equal (addr1=<value optimized out>,
> addr2=<value optimized out>) at totemip.c:106
>> #4  0x00007f36838c2e35 in srp_addr_equal (instance=0x7f3683ea8010,
> memb_join=0x743e78) at totemsrp.c:1114
>> #5  memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at
> totemsrp.c:1291
>> #6  memb_join_process (instance=0x7f3683ea8010, memb_join=0x743e78) at
> totemsrp.c:4047
>> #7  0x00007f36838c35bc in message_handler_memb_join
> (instance=0x7f3683ea8010, msg=<value optimized out>, msg_len=<value
> optimized out>,
>>     endian_conversion_needed=<value optimized out>) at totemsrp.c:4304
>> #8  0x00007f36838b9748 in rrp_deliver_fn (context=0x7028c0, msg=0x743e78,
> msg_len=159) at totemrrp.c:1792
>> #9  0x00007f36838b6c61 in net_deliver_fn (fd=<value optimized out>,
> revents=<value optimized out>, data=<value optimized out>) at totemudp.c:465
>> #10 0x00007f3683457c1f in ?? ()
>>
>> Dumping out the lists being compared shows that they both have two
> entries. The first entry in both cases is the local node. The second entry
> is zeroed.
>>
>> (gdb) p instance->my_proc_list->addr[0]
>> $16 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats
> 11 times>}
>> (gdb) p instance->my_proc_list->addr[1]
>> $17 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}
>> (gdb) p instance->my_proc_list_entries
>> $18 = 2
>>
>> (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[0]
>> $23 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats
> 11 times>}
>> (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[1]
>> $24 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}
>> (gdb) p memb_join->proc_list_entries
>> $10 = 2
>>
>> The log messages show the local now forming and breaking associations with
> the peer node but nothing which looks obviously wrong:
>>
>> [QUORUM] This node is within the non-primary component and will NOT
> provide any services.
>> [QUORUM] Members[1]: 1
>> [QUORUM] Members[1]: 1
>> [TOTEM ] A processor joined or left the membership and a new membership
> (169.254.0.1:316) was formed.
>> [MAIN  ] Completed service synchronization, ready to provide service.
>> [QUORUM] Members[2]: 1 3
>> [TOTEM ] A processor joined or left the membership and a new membership
> (169.254.0.1:324) was formed.
>> [QUORUM] This node is within the primary component and will provide
> service.
>> [QUORUM] Members[2]: 1 3
>> [MAIN  ] Completed service synchronization, ready to provide service.
>> [QUORUM] This node is within the non-primary component and will NOT
> provide any services.
>> [QUORUM] Members[1]: 1
>> [QUORUM] Members[1]: 1
>> [TOTEM ] A processor joined or left the membership and a new membership
> (169.254.0.1:328) was formed.
>> [MAIN  ] Completed service synchronization, ready to provide service.
>>
>> Does this seem familiar to anyone? Might it be fixed already in a newer
> release?
>>
>> 	Jon
>>
>>
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss


[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux