Re: failing assert(addrlen) in totemip_equal()

"Burgess, Jon" <jon.burgess@xxxxxx> · Fri, 14 Dec 2012 16:41:29 +0000

After trying a few more combinations I think the trigger for the problem is
the encrypted traffic with mismatched authkeys. If two machines are running
in this state then one typically hits this assert in a few minutes. 

Restarting the processes and making/breaking the membership by swapping the
keys doesn't seem to be necessary.

    Jon

-----Original Message-----
From: Burgess, Jon 
Sent: 14 December 2012 15:13
To: 'Jan Friesse'
Cc: discuss@xxxxxxxxxxxx
Subject: RE:  failing assert(addrlen) in totemip_equal()

> Jon,
> this is new bug. Can you please send me config and describe how exactly
> are you restarting corosync? (script, kill -INT pid, ...)

I can reproduce the crash with the script below in about 5 minutes on a 
pair of F17 VMs and corosync-2.0.2-1.fc17.x86_64

The Corosync configuration has encryption enabled (aes256+sha1)  and 
the script alternates between two different keys, one which matches the 
peer and another which doesn't:

#!/bin/bash

while(true)
do
    echo "start good"
    cp /etc/corosync/authkey.good  /etc/corosync/authkey
    corosync
    usleep $(( $RANDOM * 100 ))
    echo "stop"
    killall -9 corosync
    usleep $(( $RANDOM * 100 ))

    echo "start bad"
    cp /etc/corosync/authkey.bad  /etc/corosync/authkey
    corosync
    usleep $(( $RANDOM * 100 ))
    echo "stop"
    killall -9 corosync
    usleep $(( $RANDOM * 100 ))

done

After this  has been running for a few minutes the peer crashes:

Dec 14 13:54:48 notice  [QUORUM] vsf_quorum.c:132 Members[2]: 1702537408
1719314624
Dec 14 13:54:48 notice  [TOTEM ] totemsrp.c:1909 A processor joined or left
the membership and a new membership (192.168.122.101:2824) was formed.
Dec 14 13:54:48 notice  [QUORUM] vsf_quorum.c:148 This node is within the
primary component and will provide service.
Dec 14 13:54:48 notice  [QUORUM] vsf_quorum.c:132 Members[2]: 1702537408
1719314624
Dec 14 13:54:48 notice  [MAIN  ] main.c:252 Completed service
synchronization, ready to provide service.
Dec 14 13:54:50 alert   [TOTEM ] totemcrypto.c:354 PK11_DigestFinal
(decrypt) failed (err -8023)
Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:444 Received message has invalid
digest... ignoring.
Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:446 Invalid packet data
Dec 14 13:54:50 alert   [TOTEM ] totemcrypto.c:354 PK11_DigestFinal
(decrypt) failed (err -8023)
Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:444 Received message has invalid
digest... ignoring.
Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:446 Invalid packet data
Dec 14 13:54:50 notice  [TOTEM ] totemsrp.c:1610 A processor failed, forming
new configuration.
Dec 14 13:54:50 alert   [TOTEM ] totemcrypto.c:354 PK11_DigestFinal
(decrypt) failed (err -8023)
Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:444 Received message has invalid
digest... ignoring.
Dec 14 13:54:50 alert   [TOTEM ] totemudp.c:446 Invalid packet data
corosync: totemip.c:106: totemip_equal: Assertion `addrlen' failed.
Ringbuffer:
 ->OVERWRITE
 ->write_pt [648645]
 ->read_pt [0]
 ->size [2097152 words]
 =>free [5794024 bytes]
 =>used [2594580 bytes]
Aborted (core dumped)

> Also if you have coredumps, it would be interesting to see not only
> srp_addr->addr (because only [0] is checked), but also
> memb_join->proc_list_entries, 

(gdb) p memb_join->proc_list_entries
$10 = 2

>instance->my_proc_list_entries (should be
> same) 

(gdb) p instance->my_proc_list_entries
$18 = 2

> and full content of proc_list array (must have
> memb_join->proc_list_entries entries) 

I assume you mean the "proc_list" local variable in memb_join_process()

(gdb) p proc_list
$9 = <value optimized out>

But...
        proc_list = (struct srp_addr *)memb_join->end_of_memb_join;

The contents of proc_list should be the same as this:

(gdb) p *(struct srp_addr *)memb_join->end_of_memb_join
$22 = {no_addrs = 2 '\002', addr = {{nodeid = 1, family = 2, addr =
"\251\376\000\001", '\000' <repeats 11 times>}, {nodeid = 0, family = 0,
      addr = '\000' <repeats 15 times>}}}

> and instance->my_proc_list (should
> have instance->my_proc_list_entries entries).

(gdb) p instance->my_proc_list_entries
$11 = 2

(gdb) p instance->my_proc_list
$12 = {{no_addrs = 2 '\002', addr = {{nodeid = 1, family = 2, addr =
"\251\376\000\001", '\000' <repeats 11 times>}, {nodeid = 0, family = 0,
        addr = '\000' <repeats 15 times>}}}, {no_addrs = 64 '@', addr =
{{nodeid = 2171259156, family = 34025,
        addr = "\371[E\341t\277\006\360\031\350\256 \022\356", <incomplete
sequence \343>}, {nodeid = 809278438, family = 58892,
        addr = "\004q3\267\030bC\243\255hm)\377\036\316\060"}}}, {no_addrs =
0 '\000', addr = {{nodeid = 0, family = 0,
        addr = '\000' <repeats 15 times>}, {nodeid = 0, family = 0, addr =
'\000' <repeats 15 times>}}} <repeats 382 times>}

> First entry in proc_list and instance->my_proc_list looks same.

Something I forgot to mention last time was that the assert triggers when it
tries to compare the two 
zeroed addresses at the [1] position in the list:

(gdb) up
#5  memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at
totemsrp.c:1291
1291                            if (srp_addr_equal (&set1[j], &set2[i])) {
 (gdb) p i
$6 = 1
(gdb) p j
$7 = 1

>
>Regards,
>   Honza

Burgess, Jon napsal(a):
> I have been doing some tests which involve breaking the connection between
two nodes by restarting Corosync and occasionally I see the code failing the
assert(addrlen) in totemip_equal(). I have hit this a couple of times now
but I'm not sure exactly how reproducible it is. This is with
Corosync-2.0.1. 
> 
> (gdb) bt
> #0  0x00007f3681ad9da5 in raise (sig=<value optimized out>) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
> #1  0x00007f3681adb2c3 in abort () at abort.c:88
> #2  0x00007f3681ad2d99 in __assert_fail (assertion=0x7f36838c7278
"addrlen", file=0x7f36838c726e "totemip.c", line=106,
>     function=0x7f36838c7290 "totemip_equal") at assert.c:78
> #3  0x00007f36838b5229 in totemip_equal (addr1=<value optimized out>,
addr2=<value optimized out>) at totemip.c:106
> #4  0x00007f36838c2e35 in srp_addr_equal (instance=0x7f3683ea8010,
memb_join=0x743e78) at totemsrp.c:1114
> #5  memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at
totemsrp.c:1291
> #6  memb_join_process (instance=0x7f3683ea8010, memb_join=0x743e78) at
totemsrp.c:4047
> #7  0x00007f36838c35bc in message_handler_memb_join
(instance=0x7f3683ea8010, msg=<value optimized out>, msg_len=<value
optimized out>,
>     endian_conversion_needed=<value optimized out>) at totemsrp.c:4304
> #8  0x00007f36838b9748 in rrp_deliver_fn (context=0x7028c0, msg=0x743e78,
msg_len=159) at totemrrp.c:1792
> #9  0x00007f36838b6c61 in net_deliver_fn (fd=<value optimized out>,
revents=<value optimized out>, data=<value optimized out>) at totemudp.c:465
> #10 0x00007f3683457c1f in ?? ()
> 
> Dumping out the lists being compared shows that they both have two
entries. The first entry in both cases is the local node. The second entry
is zeroed.
> 
> (gdb) p instance->my_proc_list->addr[0]
> $16 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats
11 times>}
> (gdb) p instance->my_proc_list->addr[1]
> $17 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}
> (gdb) p instance->my_proc_list_entries
> $18 = 2
> 
> (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[0]
> $23 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats
11 times>}
> (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[1]
> $24 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}
> (gdb) p memb_join->proc_list_entries
> $10 = 2
> 
> The log messages show the local now forming and breaking associations with
the peer node but nothing which looks obviously wrong:
> 
> [QUORUM] This node is within the non-primary component and will NOT
provide any services.
> [QUORUM] Members[1]: 1
> [QUORUM] Members[1]: 1
> [TOTEM ] A processor joined or left the membership and a new membership
(169.254.0.1:316) was formed.
> [MAIN  ] Completed service synchronization, ready to provide service.
> [QUORUM] Members[2]: 1 3
> [TOTEM ] A processor joined or left the membership and a new membership
(169.254.0.1:324) was formed.
> [QUORUM] This node is within the primary component and will provide
service.
> [QUORUM] Members[2]: 1 3
> [MAIN  ] Completed service synchronization, ready to provide service.
> [QUORUM] This node is within the non-primary component and will NOT
provide any services.
> [QUORUM] Members[1]: 1
> [QUORUM] Members[1]: 1
> [TOTEM ] A processor joined or left the membership and a new membership
(169.254.0.1:328) was formed.
> [MAIN  ] Completed service synchronization, ready to provide service.
> 
> Does this seem familiar to anyone? Might it be fixed already in a newer
release?
> 
> 	Jon
> 
> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss