failing assert(addrlen) in totemip_equal()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have been doing some tests which involve breaking the connection between two nodes by restarting Corosync and occasionally I see the code failing the assert(addrlen) in totemip_equal(). I have hit this a couple of times now but I'm not sure exactly how reproducible it is. This is with Corosync-2.0.1. 

(gdb) bt
#0  0x00007f3681ad9da5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f3681adb2c3 in abort () at abort.c:88
#2  0x00007f3681ad2d99 in __assert_fail (assertion=0x7f36838c7278 "addrlen", file=0x7f36838c726e "totemip.c", line=106,
    function=0x7f36838c7290 "totemip_equal") at assert.c:78
#3  0x00007f36838b5229 in totemip_equal (addr1=<value optimized out>, addr2=<value optimized out>) at totemip.c:106
#4  0x00007f36838c2e35 in srp_addr_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at totemsrp.c:1114
#5  memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at totemsrp.c:1291
#6  memb_join_process (instance=0x7f3683ea8010, memb_join=0x743e78) at totemsrp.c:4047
#7  0x00007f36838c35bc in message_handler_memb_join (instance=0x7f3683ea8010, msg=<value optimized out>, msg_len=<value optimized out>,
    endian_conversion_needed=<value optimized out>) at totemsrp.c:4304
#8  0x00007f36838b9748 in rrp_deliver_fn (context=0x7028c0, msg=0x743e78, msg_len=159) at totemrrp.c:1792
#9  0x00007f36838b6c61 in net_deliver_fn (fd=<value optimized out>, revents=<value optimized out>, data=<value optimized out>) at totemudp.c:465
#10 0x00007f3683457c1f in ?? ()

Dumping out the lists being compared shows that they both have two entries. The first entry in both cases is the local node. The second entry is zeroed.

(gdb) p instance->my_proc_list->addr[0]
$16 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats 11 times>}
(gdb) p instance->my_proc_list->addr[1]
$17 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}
(gdb) p instance->my_proc_list_entries
$18 = 2

(gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[0]
$23 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats 11 times>}
(gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[1]
$24 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}
(gdb) p memb_join->proc_list_entries
$10 = 2

The log messages show the local now forming and breaking associations with the peer node but nothing which looks obviously wrong:

[QUORUM] This node is within the non-primary component and will NOT provide any services.
[QUORUM] Members[1]: 1
[QUORUM] Members[1]: 1
[TOTEM ] A processor joined or left the membership and a new membership (169.254.0.1:316) was formed.
[MAIN  ] Completed service synchronization, ready to provide service.
[QUORUM] Members[2]: 1 3
[TOTEM ] A processor joined or left the membership and a new membership (169.254.0.1:324) was formed.
[QUORUM] This node is within the primary component and will provide service.
[QUORUM] Members[2]: 1 3
[MAIN  ] Completed service synchronization, ready to provide service.
[QUORUM] This node is within the non-primary component and will NOT provide any services.
[QUORUM] Members[1]: 1
[QUORUM] Members[1]: 1
[TOTEM ] A processor joined or left the membership and a new membership (169.254.0.1:328) was formed.
[MAIN  ] Completed service synchronization, ready to provide service.

Does this seem familiar to anyone? Might it be fixed already in a newer release?

	Jon



_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss


[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux