Re: failing assert(addrlen) in totemip_equal()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jon,
this is new bug. Can you please send me config and describe how exactly
are you restarting corosync? (script, kill -INT pid, ...)

Also if you have coredumps, it would be interesting to see not only
srp_addr->addr (because only [0] is checked), but also
memb_join->proc_list_entries, instance->my_proc_list_entries (should be
same) and full content of proc_list array (must have
memb_join->proc_list_entries entries) and instance->my_proc_list (should
have instance->my_proc_list_entries entries).

First entry in proc_list and instance->my_proc_list looks same.

Regards,
  Honza

Burgess, Jon napsal(a):
> I have been doing some tests which involve breaking the connection between two nodes by restarting Corosync and occasionally I see the code failing the assert(addrlen) in totemip_equal(). I have hit this a couple of times now but I'm not sure exactly how reproducible it is. This is with Corosync-2.0.1. 
> 
> (gdb) bt
> #0  0x00007f3681ad9da5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> #1  0x00007f3681adb2c3 in abort () at abort.c:88
> #2  0x00007f3681ad2d99 in __assert_fail (assertion=0x7f36838c7278 "addrlen", file=0x7f36838c726e "totemip.c", line=106,
>     function=0x7f36838c7290 "totemip_equal") at assert.c:78
> #3  0x00007f36838b5229 in totemip_equal (addr1=<value optimized out>, addr2=<value optimized out>) at totemip.c:106
> #4  0x00007f36838c2e35 in srp_addr_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at totemsrp.c:1114
> #5  memb_set_equal (instance=0x7f3683ea8010, memb_join=0x743e78) at totemsrp.c:1291
> #6  memb_join_process (instance=0x7f3683ea8010, memb_join=0x743e78) at totemsrp.c:4047
> #7  0x00007f36838c35bc in message_handler_memb_join (instance=0x7f3683ea8010, msg=<value optimized out>, msg_len=<value optimized out>,
>     endian_conversion_needed=<value optimized out>) at totemsrp.c:4304
> #8  0x00007f36838b9748 in rrp_deliver_fn (context=0x7028c0, msg=0x743e78, msg_len=159) at totemrrp.c:1792
> #9  0x00007f36838b6c61 in net_deliver_fn (fd=<value optimized out>, revents=<value optimized out>, data=<value optimized out>) at totemudp.c:465
> #10 0x00007f3683457c1f in ?? ()
> 
> Dumping out the lists being compared shows that they both have two entries. The first entry in both cases is the local node. The second entry is zeroed.
> 
> (gdb) p instance->my_proc_list->addr[0]
> $16 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats 11 times>}
> (gdb) p instance->my_proc_list->addr[1]
> $17 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}
> (gdb) p instance->my_proc_list_entries
> $18 = 2
> 
> (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[0]
> $23 = {nodeid = 1, family = 2, addr = "\251\376\000\001", '\000' <repeats 11 times>}
> (gdb) p ((struct srp_addr *)memb_join->end_of_memb_join)->addr[1]
> $24 = {nodeid = 0, family = 0, addr = '\000' <repeats 15 times>}
> (gdb) p memb_join->proc_list_entries
> $10 = 2
> 
> The log messages show the local now forming and breaking associations with the peer node but nothing which looks obviously wrong:
> 
> [QUORUM] This node is within the non-primary component and will NOT provide any services.
> [QUORUM] Members[1]: 1
> [QUORUM] Members[1]: 1
> [TOTEM ] A processor joined or left the membership and a new membership (169.254.0.1:316) was formed.
> [MAIN  ] Completed service synchronization, ready to provide service.
> [QUORUM] Members[2]: 1 3
> [TOTEM ] A processor joined or left the membership and a new membership (169.254.0.1:324) was formed.
> [QUORUM] This node is within the primary component and will provide service.
> [QUORUM] Members[2]: 1 3
> [MAIN  ] Completed service synchronization, ready to provide service.
> [QUORUM] This node is within the non-primary component and will NOT provide any services.
> [QUORUM] Members[1]: 1
> [QUORUM] Members[1]: 1
> [TOTEM ] A processor joined or left the membership and a new membership (169.254.0.1:328) was formed.
> [MAIN  ] Completed service synchronization, ready to provide service.
> 
> Does this seem familiar to anyone? Might it be fixed already in a newer release?
> 
> 	Jon
> 
> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss


[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux