Hi Honza, i'll take a look when I am back from holidays next week. Fabio On 01/02/2013 11:22 AM, Jan Friesse wrote: > Jon, > thanks for digging into problem. Generally, it's bad idea to use > different authkeys (thats why we logging so many decrypt errors). > Corosync itself is NOT handling byzantine faults (what is happening) and > this is why problem appears. On the other hand, we are using digest to > prevent some kind of byzantine failures (and one you see should be that > kind of failure). > > I believe main problem is, that we are using HMAC, BUT without ever > initializing HMAC key (so you will not get HASH fault, but rather > DECRYPT fault, and there CAN happen that data are decrypted (and > interpreted) into total garbage). > > So main question is (fabbione). Shouldn't we use HMAC with initialized > key (in init_nss_hash, hash_param.data points to key, and hash_param.len > to key_len) (as stated in > http://www.mozilla.org/projects/security/pki/nss/tech-notes/tn5.html in > HMAC section)? I believe this would solve problem Jon is hitting. > > Regards, > Honza > > Burgess, Jon napsal(a): >> After trying a few more combinations I think the trigger for the problem is >> the encrypted traffic with mismatched authkeys. If two machines are running >> in this state then one typically hits this assert in a few minutes. >> >> Restarting the processes and making/breaking the membership by swapping the >> keys doesn't seem to be necessary. >> >> Jon >> >> >> -----Original Message----- >> From: Burgess, Jon >> Sent: 14 December 2012 15:13 >> To: 'Jan Friesse' >> Cc: discuss@xxxxxxxxxxxx >> Subject: RE: failing assert(addrlen) in totemip_equal() >> >>> Jon, >>> this is new bug. Can you please send me config and describe how exactly >>> are you restarting corosync? (script, kill -INT pid, ...) >> >> I can reproduce the crash with the script below in about 5 minutes on a >> pair of F17 VMs and corosync-2.0.2-1.fc17.x86_64 >> >> The Corosync configuration has encryption enabled (aes256+sha1) and >> the script alternates between two different keys, one which matches the >> peer and another which doesn't: >> >> >> #!/bin/bash >> >> while(true) >> do >> echo "start good" >> cp /etc/corosync/authkey.good /etc/corosync/authkey >> corosync >> usleep $(( $RANDOM * 100 )) >> echo "stop" >> killall -9 corosync >> usleep $(( $RANDOM * 100 )) >> >> echo "start bad" >> cp /etc/corosync/authkey.bad /etc/corosync/authkey >> corosync >> usleep $(( $RANDOM * 100 )) >> echo "stop" >> killall -9 corosync >> usleep $(( $RANDOM * 100 )) >> >> done >> >> >> After this has been running for a few minutes the peer crashes: >> >> Dec 14 13:54:48 notice [QUORUM] vsf_quorum.c:132 Members[2]: 1702537408 >> 1719314624 >> Dec 14 13:54:48 notice [TOTEM ] totemsrp.c:1909 A processor joined or left >> the membership and a new membership (192.168.122.101:2824) was formed. >> Dec 14 13:54:48 notice [QUORUM] vsf_quorum.c:148 This node is within the >> primary component and will provide service. >> Dec 14 13:54:48 notice [QUORUM] vsf_quorum.c:132 Members[2]: 1702537408 >> 1719314624 >> Dec 14 13:54:48 notice [MAIN ] main.c:252 Completed service >> synchronization, ready to provide service. >> Dec 14 13:54:50 alert [TOTEM ] totemcrypto.c:354 PK11_DigestFinal >> (decrypt) failed (err -8023) >> Dec 14 13:54:50 alert [TOTEM ] totemudp.c:444 Received message has invalid >> digest... ignoring. >> Dec 14 13:54:50 alert [TOTEM ] totemudp.c:446 Invalid packet data >> Dec 14 13:54:50 alert [TOTEM ] totemcrypto.c:354 PK11_DigestFinal >> (decrypt) failed (err -8023) >> Dec 14 13:54:50 alert [TOTEM ] totemudp.c:444 Received message has invalid >> digest... ignoring. >> Dec 14 13:54:50 alert [TOTEM ] totemudp.c:446 Invalid packet data >> Dec 14 13:54:50 notice [TOTEM ] totemsrp.c:1610 A processor failed, forming >> new configuration. >> Dec 14 13:54:50 alert [TOTEM ] totemcrypto.c:354 PK11_DigestFinal >> (decrypt) failed (err -8023) >> Dec 14 13:54:50 alert [TOTEM ] totemudp.c:444 Received message has invalid >> digest... ignoring. >> Dec 14 13:54:50 alert [TOTEM ] totemudp.c:446 Invalid packet data >> corosync: totemip.c:106: totemip_equal: Assertion `addrlen' failed. >> Ringbuffer: >> ->OVERWRITE >> ->write_pt [648645] >> ->read_pt [0] >> ->size [2097152 words] >> =>free [5794024 bytes] >> =>used [2594580 bytes] >> Aborted (core dumped) >> >> >> >>> Also if you have coredumps, it would be interesting to see not only >>> srp_addr->addr (because only [0] is checked), but also >>> memb_join->proc_list_entries, >> >> (gdb) p memb_join->proc_list_entries >> $10 = 2 >> >> >>> instance->my_proc_list_entries (should be >>> same) >> >> (gdb) p instance->my_proc_list_entries >> $18 = 2 >> >>> and full content of proc_list array (must have >>> memb_join->proc_list_entries entries) >> >> I assume you mean the "proc_list" local variable in memb_join_process() >> >> (gdb) p proc_list >> $9 = <value optimized out> >> >> But... >> proc_list = (struct srp_addr *)memb_join->end_of_memb_join; >> >> The contents of proc_list should be the same as this: >> >> (gdb) p *(struct srp_addr *)memb_join->end_of_memb_join >> $22 = {no_addrs = 2 '\002', addr = {{nodeid = 1, family = 2, addr = >> "\251\376\000\001", '\000' <repeats 11 times>}, {nodeid = 0, family = 0, >> addr = '\000' <repeats 15 times>}}} >> >> >>> and instance->my_proc_list (should >>> have instance->my_proc_list_entries entries). >> >> (gdb) p instance->my_proc_list_entries >> $11 = 2 >> >> (gdb) p instance->my_proc_list >> $12 = {{no_addrs = 2 '\002', addr = {{nodeid = 1, family = 2, addr = >> "\251\376\000\001", '\000' <repeats 11 times>}, {nodeid = 0, family = 0, >> addr = '\000' <repeats 15 times>}}}, {no_addrs = 64 '@', addr = >> {{nodeid = 2171259156, family = 34025, >> addr = "\371[E\341t\277\006\360\031\350\256 \022\356", <incomplete >> sequence \343>}, {nodeid = 809278438, family = 58892, >> addr = "\004q3\267\030bC\243\255hm)\377\036\316\060"}}}, {no_addrs = >> 0 '\000', addr = {{nodeid = 0, family = 0, >> addr = '\000' <repeats 15 times>}, {nodeid = 0, family = 0, addr = >> '\000' <repeats 15 times>}}} <repeats 382 times>} >> >> >> >>> First entry in proc_list and instance->my_proc_list looks same. >> >> Something I forgot to mention last time was that the assert triggers when it >> tries to compare the two >> zeroed addresses at the [1] position in the list: >> >> (gdb) up >> #5 memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at >> totemsrp.c:1291 >> 1291 if (srp_addr_equal (&set1[j], &set2[i])) { >> (gdb) p i >> $6 = 1 >> (gdb) p j >> $7 = 1 >> >> >>> >>> Regards, >>> Honza >> >> >> >> Burgess, Jon napsal(a): >>> I have been doing some tests which involve breaking the connection between >> two nodes by restarting Corosync and occasionally I see the code failing the >> assert(addrlen) in totemip_equal(). I have hit this a couple of times now >> but I'm not sure exactly how reproducible it is. This is with >> Corosync-2.0.1. >>> >>> (gdb) bt >>> #0 0x00007f3681ad9da5 in raise (sig=<value optimized out>) at >> ../nptl/sysdeps/unix/sysv/linux/raise.c:64 >>> #1 0x00007f3681adb2c3 in abort () at abort.c:88 >>> #2 0x00007f3681ad2d99 in __assert_fail (assertion=0x7f36838c7278 >> "addrlen", file=0x7f36838c726e "totemip.c", line=106, >>> function=0x7f36838c7290 "totemip_equal") at assert.c:78 >>> #3 0x00007f36838b5229 in totemip_equal (addr1=<value optimized out>, >> addr2=<value optimized out>) at totemip.c:106 >>> #4 0x00007f36838c2e35 in srp_addr_equal (instance=0x7f3683ea8010, >> memb_join=0x743e78) at totemsrp.c:1114 >>> #5 memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at >> totemsrp.c:1291 >>> #6 memb_join_process (instance=0x7f3683ea8010, memb_join=0x743e78) at >> totemsrp.c:4047 >>> #7 0x00007f36838c35bc in message_handler_memb_join >> (instance=0x7f3683ea8010, msg=<value optimized out>, msg_len=<value >> optimized out>, >>> endian_conversion_needed=<value optimized out>) at totemsrp.c:4304 >>> #8 0x00007f36838b9748 in rrp_deliver_fn (context=0x7028c0, msg=0x743e78, >> msg_len=159) at totemrrp.c:1792 >>> #9 0x00007f36838b6c61 in net_deliver_fn (fd=<value optimized out>, >> revents=<value optimized out>, data=<value optimized out>) at totemudp.c:465 >>> #10 0x00007f3683457c1f in ?? () >>> >>> Dumping out the lists being compared shows that they both have two >> entries. The first entry in both cases is the local node. The second entry >> is zeroed. >>> >>> (gdb) p instance->my_proc_list->addr[0] >>> $16 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats >> 11 times>} >>> (gdb) p instance->my_proc_list->addr[1] >>> $17 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>} >>> (gdb) p instance->my_proc_list_entries >>> $18 = 2 >>> >>> (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[0] >>> $23 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats >> 11 times>} >>> (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[1] >>> $24 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>} >>> (gdb) p memb_join->proc_list_entries >>> $10 = 2 >>> >>> The log messages show the local now forming and breaking associations with >> the peer node but nothing which looks obviously wrong: >>> >>> [QUORUM] This node is within the non-primary component and will NOT >> provide any services. >>> [QUORUM] Members[1]: 1 >>> [QUORUM] Members[1]: 1 >>> [TOTEM ] A processor joined or left the membership and a new membership >> (169.254.0.1:316) was formed. >>> [MAIN ] Completed service synchronization, ready to provide service. >>> [QUORUM] Members[2]: 1 3 >>> [TOTEM ] A processor joined or left the membership and a new membership >> (169.254.0.1:324) was formed. >>> [QUORUM] This node is within the primary component and will provide >> service. >>> [QUORUM] Members[2]: 1 3 >>> [MAIN ] Completed service synchronization, ready to provide service. >>> [QUORUM] This node is within the non-primary component and will NOT >> provide any services. >>> [QUORUM] Members[1]: 1 >>> [QUORUM] Members[1]: 1 >>> [TOTEM ] A processor joined or left the membership and a new membership >> (169.254.0.1:328) was formed. >>> [MAIN ] Completed service synchronization, ready to provide service. >>> >>> Does this seem familiar to anyone? Might it be fixed already in a newer >> release? >>> >>> Jon >>> >>> >>> >>> _______________________________________________ >>> discuss mailing list >>> discuss@xxxxxxxxxxxx >>> http://lists.corosync.org/mailman/listinfo/discuss >> > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss