After trying a few more combinations I think the trigger for the problem is the encrypted traffic with mismatched authkeys. If two machines are running in this state then one typically hits this assert in a few minutes. Restarting the processes and making/breaking the membership by swapping the keys doesn't seem to be necessary. Jon -----Original Message----- From: Burgess, Jon Sent: 14 December 2012 15:13 To: 'Jan Friesse' Cc: discuss@xxxxxxxxxxxx Subject: RE: failing assert(addrlen) in totemip_equal() > Jon, > this is new bug. Can you please send me config and describe how exactly > are you restarting corosync? (script, kill -INT pid, ...) I can reproduce the crash with the script below in about 5 minutes on a pair of F17 VMs and corosync-2.0.2-1.fc17.x86_64 The Corosync configuration has encryption enabled (aes256+sha1) and the script alternates between two different keys, one which matches the peer and another which doesn't: #!/bin/bash while(true) do echo "start good" cp /etc/corosync/authkey.good /etc/corosync/authkey corosync usleep $(( $RANDOM * 100 )) echo "stop" killall -9 corosync usleep $(( $RANDOM * 100 )) echo "start bad" cp /etc/corosync/authkey.bad /etc/corosync/authkey corosync usleep $(( $RANDOM * 100 )) echo "stop" killall -9 corosync usleep $(( $RANDOM * 100 )) done After this has been running for a few minutes the peer crashes: Dec 14 13:54:48 notice [QUORUM] vsf_quorum.c:132 Members[2]: 1702537408 1719314624 Dec 14 13:54:48 notice [TOTEM ] totemsrp.c:1909 A processor joined or left the membership and a new membership (192.168.122.101:2824) was formed. Dec 14 13:54:48 notice [QUORUM] vsf_quorum.c:148 This node is within the primary component and will provide service. Dec 14 13:54:48 notice [QUORUM] vsf_quorum.c:132 Members[2]: 1702537408 1719314624 Dec 14 13:54:48 notice [MAIN ] main.c:252 Completed service synchronization, ready to provide service. Dec 14 13:54:50 alert [TOTEM ] totemcrypto.c:354 PK11_DigestFinal (decrypt) failed (err -8023) Dec 14 13:54:50 alert [TOTEM ] totemudp.c:444 Received message has invalid digest... ignoring. Dec 14 13:54:50 alert [TOTEM ] totemudp.c:446 Invalid packet data Dec 14 13:54:50 alert [TOTEM ] totemcrypto.c:354 PK11_DigestFinal (decrypt) failed (err -8023) Dec 14 13:54:50 alert [TOTEM ] totemudp.c:444 Received message has invalid digest... ignoring. Dec 14 13:54:50 alert [TOTEM ] totemudp.c:446 Invalid packet data Dec 14 13:54:50 notice [TOTEM ] totemsrp.c:1610 A processor failed, forming new configuration. Dec 14 13:54:50 alert [TOTEM ] totemcrypto.c:354 PK11_DigestFinal (decrypt) failed (err -8023) Dec 14 13:54:50 alert [TOTEM ] totemudp.c:444 Received message has invalid digest... ignoring. Dec 14 13:54:50 alert [TOTEM ] totemudp.c:446 Invalid packet data corosync: totemip.c:106: totemip_equal: Assertion `addrlen' failed. Ringbuffer: ->OVERWRITE ->write_pt [648645] ->read_pt [0] ->size [2097152 words] =>free [5794024 bytes] =>used [2594580 bytes] Aborted (core dumped) > Also if you have coredumps, it would be interesting to see not only > srp_addr->addr (because only [0] is checked), but also > memb_join->proc_list_entries, (gdb) p memb_join->proc_list_entries $10 = 2 >instance->my_proc_list_entries (should be > same) (gdb) p instance->my_proc_list_entries $18 = 2 > and full content of proc_list array (must have > memb_join->proc_list_entries entries) I assume you mean the "proc_list" local variable in memb_join_process() (gdb) p proc_list $9 = <value optimized out> But... proc_list = (struct srp_addr *)memb_join->end_of_memb_join; The contents of proc_list should be the same as this: (gdb) p *(struct srp_addr *)memb_join->end_of_memb_join $22 = {no_addrs = 2 '\002', addr = {{nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats 11 times>}, {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}} > and instance->my_proc_list (should > have instance->my_proc_list_entries entries). (gdb) p instance->my_proc_list_entries $11 = 2 (gdb) p instance->my_proc_list $12 = {{no_addrs = 2 '\002', addr = {{nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats 11 times>}, {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}}, {no_addrs = 64 '@', addr = {{nodeid = 2171259156, family = 34025, addr = "\371[E\341t\277\006\360\031\350\256 \022\356", <incomplete sequence \343>}, {nodeid = 809278438, family = 58892, addr = "\004q3\267\030bC\243\255hm)\377\036\316\060"}}}, {no_addrs = 0 '\000', addr = {{nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}, {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}}} <repeats 382 times>} > First entry in proc_list and instance->my_proc_list looks same. Something I forgot to mention last time was that the assert triggers when it tries to compare the two zeroed addresses at the [1] position in the list: (gdb) up #5 memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at totemsrp.c:1291 1291 if (srp_addr_equal (&set1[j], &set2[i])) { (gdb) p i $6 = 1 (gdb) p j $7 = 1 > >Regards, > Honza Burgess, Jon napsal(a): > I have been doing some tests which involve breaking the connection between two nodes by restarting Corosync and occasionally I see the code failing the assert(addrlen) in totemip_equal(). I have hit this a couple of times now but I'm not sure exactly how reproducible it is. This is with Corosync-2.0.1. > > (gdb) bt > #0 0x00007f3681ad9da5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 > #1 0x00007f3681adb2c3 in abort () at abort.c:88 > #2 0x00007f3681ad2d99 in __assert_fail (assertion=0x7f36838c7278 "addrlen", file=0x7f36838c726e "totemip.c", line=106, > function=0x7f36838c7290 "totemip_equal") at assert.c:78 > #3 0x00007f36838b5229 in totemip_equal (addr1=<value optimized out>, addr2=<value optimized out>) at totemip.c:106 > #4 0x00007f36838c2e35 in srp_addr_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at totemsrp.c:1114 > #5 memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at totemsrp.c:1291 > #6 memb_join_process (instance=0x7f3683ea8010, memb_join=0x743e78) at totemsrp.c:4047 > #7 0x00007f36838c35bc in message_handler_memb_join (instance=0x7f3683ea8010, msg=<value optimized out>, msg_len=<value optimized out>, > endian_conversion_needed=<value optimized out>) at totemsrp.c:4304 > #8 0x00007f36838b9748 in rrp_deliver_fn (context=0x7028c0, msg=0x743e78, msg_len=159) at totemrrp.c:1792 > #9 0x00007f36838b6c61 in net_deliver_fn (fd=<value optimized out>, revents=<value optimized out>, data=<value optimized out>) at totemudp.c:465 > #10 0x00007f3683457c1f in ?? () > > Dumping out the lists being compared shows that they both have two entries. The first entry in both cases is the local node. The second entry is zeroed. > > (gdb) p instance->my_proc_list->addr[0] > $16 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats 11 times>} > (gdb) p instance->my_proc_list->addr[1] > $17 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>} > (gdb) p instance->my_proc_list_entries > $18 = 2 > > (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[0] > $23 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats 11 times>} > (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[1] > $24 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>} > (gdb) p memb_join->proc_list_entries > $10 = 2 > > The log messages show the local now forming and breaking associations with the peer node but nothing which looks obviously wrong: > > [QUORUM] This node is within the non-primary component and will NOT provide any services. > [QUORUM] Members[1]: 1 > [QUORUM] Members[1]: 1 > [TOTEM ] A processor joined or left the membership and a new membership (169.254.0.1:316) was formed. > [MAIN ] Completed service synchronization, ready to provide service. > [QUORUM] Members[2]: 1 3 > [TOTEM ] A processor joined or left the membership and a new membership (169.254.0.1:324) was formed. > [QUORUM] This node is within the primary component and will provide service. > [QUORUM] Members[2]: 1 3 > [MAIN ] Completed service synchronization, ready to provide service. > [QUORUM] This node is within the non-primary component and will NOT provide any services. > [QUORUM] Members[1]: 1 > [QUORUM] Members[1]: 1 > [TOTEM ] A processor joined or left the membership and a new membership (169.254.0.1:328) was formed. > [MAIN ] Completed service synchronization, ready to provide service. > > Does this seem familiar to anyone? Might it be fixed already in a newer release? > > Jon > > > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss