Hi, it might be a NetBSD related bug or something is interfering with its pthread implementation. I found a 2.x release not compiling on NetBSD some time ago, but you are right - the 2.1.0 release can be successfully build. There is another issue - this happens when I start it: Dec 10 11:21:16 [21835] ctx4980gate2 corosync notice [TOTEM ] Initializing transport (UDP/IP Multicast). Dec 10 11:21:16 [21835] ctx4980gate2 corosync notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none Dec 10 11:21:16 [21835] ctx4980gate2 corosync error [QB ] kevent(poll): Bad file descriptor (9) Dec 10 11:21:16 [21835] ctx4980gate2 corosync warning [QB ] fd->poll: Bad file descriptor (9) Dec 10 11:21:16 [21835] ctx4980gate2 corosync error [QB ] kevent(poll): Bad file descriptor (9) Dec 10 11:21:16 [21835] ctx4980gate2 corosync warning [QB ] fd->poll: Bad file descriptor (9) Dec 10 11:21:16 [21835] ctx4980gate2 corosync error [QB ] kevent(poll): Bad file descriptor (9) Dec 10 11:21:16 [21835] ctx4980gate2 corosync warning [QB ] fd->poll: Bad file descriptor (9) Dec 10 11:21:16 [21835] ctx4980gate2 corosync error [QB ] kevent(poll): Bad file descriptor (9) Dec 10 11:21:16 [21835] ctx4980gate2 corosync warning [QB ] fd->poll: Bad file descriptor (9) Dec 10 11:21:16 [21835] ctx4980gate2 corosync error [QB ] kevent(poll): Bad file descriptor (9) Dec 10 11:21:16 [21835] ctx4980gate2 corosync warning [QB ] fd->poll: Bad file descriptor (9) Dec 10 11:21:16 [21835] ctx4980gate2 corosync error [QB ] kevent(poll): Bad file descriptor (9) Dec 10 11:21:16 [21835] ctx4980gate2 corosync warning [QB ] fd->poll: Bad file descriptor (9) Dec 10 11:21:16 [21835] ctx4980gate2 corosync error [QB ] kevent(poll): Bad file descriptor (9) Ideas? Regards, Stephan 2012/12/10 Jan Friesse <jfriesse@xxxxxxxxxx>: > Stephan, > do you think that it can be problem in NetBSD thread code itself? > Because if so, I cannot do to much with that (other then advise you to > try corosync 2.1.x + pacemaker 1.1, this is no longer based on plugins > (uses cpg directly) so it should not fail and 2.1 was tested on NetBSD > to at least compile and basic work). If you believe that this is problem > in corosync, can you please try to run some kind of tool (I don't know > if valgrind is available) to give me hint what is happening (like there > is overwrite of memory, ...). > > Regards, > Honza > > Stephan napsal(a): >> Hi Jan, >> >> this happens both using "corosync-cfgtool -l" or a file in service.d. >> It seems that something hoses the threads internal data (TLS). >> According to gdb, the pointer (&conn_info->addr) passed to >> pthread_mutex_lock() (via %rdi) is correct. I added a syslog() >> statement before the call to pthread_mutex_lock() and found the >> program crashing in it. This happens because of libc愀 internal >> synchronization for threaded programs, which also calls >> pthread_mutex_lock(). >> >> The crash happens here: >> >> (gdb) frame 0 >> #0 0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1 >> (gdb) x/5i pthread_mutex_lock >> 0x7f7ff68078e0 <pthread_mutex_lock>: mov %fs:0x0,%rax >> => 0x7f7ff68078e9 <pthread_mutex_lock+9>: mov 0x10(%rax),%rdx >> 0x7f7ff68078ed <pthread_mutex_lock+13>: xor %eax,%eax >> 0x7f7ff68078ef <pthread_mutex_lock+15>: lock cmpxchg %rdx,0x10(%rdi) >> 0x7f7ff68078f5 <pthread_mutex_lock+21>: test %rax,%rax >> (gdb) info reg fs rax rdi >> fs 0x0 0 >> rax 0x7f7ffffffffe 140187732541438 >> rdi 0x7f7ff738f050 140187585278032 >> (gdb) frame 1 >> #1 0x00007f7ff7002e14 in ipc_thread_active (conn=0x7f7ff738f000) at >> coroipcs.c:465 >> 465 pthread_mutex_lock (&conn_info->mutex); >> (gdb) p &conn_info->mutex >> $2 = (pthread_mutex_t *) 0x7f7ff738f050 >> >> >> >> Probably not easy to fix... >> >> Regards, >> >> Stephan >> >> 2012/12/10 Jan Friesse <jfriesse@xxxxxxxxxx>: >>> Stephan, >>> is this happening only with pacemaker, or is this general problem (with >>> dynamically loading of plugins)? Can you test to load different plugin >>> in runtime (like one of openais one) or try to configure to load >>> pacemaker after start: >>> >>> service { >>> name: pacemaker >>> ver: 0 >>> } >>> >>> Regards, >>> Honza >>> >>> Stephan napsal(a): >>>> Hi all, >>>> >>>> now that Corosync 1.x (1.4.4 in this case) works on NetBSD (6.0 amd64) >>>> "out of the box", I compiled Pacemaker 1.0 and 1.1 and tried to run it >>>> on top of corosync. Unfortunately, when I load Pacemaker using >>>> "corosync-cfgtool -l pacemaker", corosync crashes with SIGSEGV. >>>> >>>> I already found this with gdb: >>>> >>>> -----8<-------- >>>> Core was generated by `corosync'. >>>> Program terminated with signal 11, Segmentation fault. >>>> #0 0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1 >>>> (gdb) bt full >>>> #0 0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1 >>>> No symbol table info available. >>>> #1 0x00007f7ff7002e14 in ipc_thread_active (conn=0x7f7ff5308000) at >>>> coroipcs.c:465 >>>> conn_info = 0x7f7ff5308000 >>>> retval = 0 >>>> #2 pthread_ipc_consumer (conn=0x7f7ff5308000) at coroipcs.c:674 >>>> conn_info = 0x7f7ff5308000 >>>> header = <optimized out> >>>> coroipc_response_header = {size = 660260756, id = 5, error = 0} >>>> send_ok = <optimized out> >>>> new_message = <optimized out> >>>> sem_value = 0 >>>> #3 0x00007f7ff6809d75 in ?? () from /usr/lib/libpthread.so.1 >>>> No symbol table info available. >>>> #4 0x00007f7ff60759f0 in ___lwp_park50 () from /usr/lib/libc.so.12 >>>> No symbol table info available. >>>> Cannot access memory at address 0x7f7ff0000000 >>>> (gdb) frame 1 >>>> #1 0x00007f7ff7002e14 in ipc_thread_active (conn=0x7f7ff5308000) at >>>> coroipcs.c:465 >>>> 465 pthread_mutex_lock (&conn_info->mutex); >>>> (gdb) print &conn_info->mutex >>>> $1 = (pthread_mutex_t *) 0x7f7ff5308050 >>>> (gdb) p *$ >>>> $2 = {ptm_magic = 858980355, ptm_errorcheck = 0 '\000', ptm_pad1 = >>>> "\000\000", ptm_interlock = 0 '\000', ptm_pad2 = "\000\000", ptm_owner >>>> = 0x0, ptm_waiters = 0x0, ptm_recursed = 0, ptm_spare2 = 0x0} >>>> (gdb) frame 0 >>>> #0 0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1 >>>> (gdb) x/2i 0x00007f7ff68078e0 >>>> 0x7f7ff68078e0 <pthread_mutex_lock>: mov %fs:0x0,%rax >>>> => 0x7f7ff68078e9 <pthread_mutex_lock+9>: mov 0x10(%rax),%rdx >>>> (gdb) info reg rax rdx >>>> rax 0x7f7ffffffffe 140187732541438 >>>> rdx 0x0 0 >>>> (gdb) x/p 0x7f7ffffffffe >>>> 0x7f7ffffffffe: Cannot access memory at address 0x7f7ffffffffe >>>> ---------- >>>> >>>> -I think gdb tells us that there is a valid struct pthread_mutex_t in memory. >>>> -I think that 4 bytes are copied to the adress rax point to. In this >>>> case rax points to the last page in the stack segment, crossing the >>>> border to the next page, which is not mapped: >>>> >>>> 00007f7ffffe0000- >>>> 00007f7fffffffff 128k 0000000000000000 rw-p- >>>> (rwx) 1/0/0 00:00 0 - [ stack ] >>>> >>>> Any idea about this? >>>> >>>> Regards, >>>> >>>> Stephan >>>> _______________________________________________ >>>> discuss mailing list >>>> discuss@xxxxxxxxxxxx >>>> http://lists.corosync.org/mailman/listinfo/discuss >>> > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss