Re: Corosync/Pacemaker on NetBSD

Stephan <stephanwib@xxxxxxxxxxxxxx> · Mon, 10 Dec 2012 11:28:58 +0100

Hi Jan,

this happens both using "corosync-cfgtool -l" or a file in service.d.
It seems that something hoses the threads internal data (TLS).
According to gdb, the pointer (&conn_info->addr) passed to
pthread_mutex_lock() (via %rdi) is correct. I added a syslog()
statement before the call to pthread_mutex_lock() and found the
program crashing in it. This happens because of libc´s internal
synchronization for threaded programs, which also calls
pthread_mutex_lock().

The crash happens here:

(gdb) frame 0
#0  0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1
(gdb) x/5i pthread_mutex_lock
   0x7f7ff68078e0 <pthread_mutex_lock>: mov    %fs:0x0,%rax
=> 0x7f7ff68078e9 <pthread_mutex_lock+9>:       mov    0x10(%rax),%rdx
   0x7f7ff68078ed <pthread_mutex_lock+13>:      xor    %eax,%eax
   0x7f7ff68078ef <pthread_mutex_lock+15>:      lock cmpxchg %rdx,0x10(%rdi)
   0x7f7ff68078f5 <pthread_mutex_lock+21>:      test   %rax,%rax
(gdb) info reg fs rax rdi
fs             0x0      0
rax            0x7f7ffffffffe   140187732541438
rdi            0x7f7ff738f050   140187585278032
(gdb) frame 1
#1  0x00007f7ff7002e14 in ipc_thread_active (conn=0x7f7ff738f000) at
coroipcs.c:465
465             pthread_mutex_lock (&conn_info->mutex);
(gdb) p &conn_info->mutex
$2 = (pthread_mutex_t *) 0x7f7ff738f050

Probably not easy to fix...

Regards,

Stephan

2012/12/10 Jan Friesse <jfriesse@xxxxxxxxxx>:
> Stephan,
> is this happening only with pacemaker, or is this general problem (with
> dynamically loading of plugins)? Can you test to load different plugin
> in runtime (like one of openais one) or try to configure to load
> pacemaker after start:
>
> service {
> name: pacemaker
> ver: 0
> }
>
> Regards,
>   Honza
>
> Stephan napsal(a):
>> Hi all,
>>
>> now that Corosync 1.x (1.4.4 in this case) works on NetBSD (6.0 amd64)
>> "out of the box", I compiled Pacemaker 1.0 and 1.1 and tried to run it
>> on top of corosync. Unfortunately, when I load Pacemaker using
>> "corosync-cfgtool -l pacemaker", corosync crashes with SIGSEGV.
>>
>> I already found this with gdb:
>>
>> -----8<--------
>> Core was generated by `corosync'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1
>> (gdb) bt full
>> #0  0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1
>> No symbol table info available.
>> #1  0x00007f7ff7002e14 in ipc_thread_active (conn=0x7f7ff5308000) at
>> coroipcs.c:465
>>         conn_info = 0x7f7ff5308000
>>         retval = 0
>> #2  pthread_ipc_consumer (conn=0x7f7ff5308000) at coroipcs.c:674
>>         conn_info = 0x7f7ff5308000
>>         header = <optimized out>
>>         coroipc_response_header = {size = 660260756, id = 5, error = 0}
>>         send_ok = <optimized out>
>>         new_message = <optimized out>
>>         sem_value = 0
>> #3  0x00007f7ff6809d75 in ?? () from /usr/lib/libpthread.so.1
>> No symbol table info available.
>> #4  0x00007f7ff60759f0 in ___lwp_park50 () from /usr/lib/libc.so.12
>> No symbol table info available.
>> Cannot access memory at address 0x7f7ff0000000
>> (gdb) frame 1
>> #1  0x00007f7ff7002e14 in ipc_thread_active (conn=0x7f7ff5308000) at
>> coroipcs.c:465
>> 465             pthread_mutex_lock (&conn_info->mutex);
>> (gdb) print &conn_info->mutex
>> $1 = (pthread_mutex_t *) 0x7f7ff5308050
>> (gdb) p *$
>> $2 = {ptm_magic = 858980355, ptm_errorcheck = 0 '\000', ptm_pad1 =
>> "\000\000", ptm_interlock = 0 '\000', ptm_pad2 = "\000\000", ptm_owner
>> = 0x0, ptm_waiters = 0x0, ptm_recursed = 0, ptm_spare2 = 0x0}
>> (gdb) frame 0
>> #0  0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1
>> (gdb) x/2i 0x00007f7ff68078e0
>>    0x7f7ff68078e0 <pthread_mutex_lock>: mov    %fs:0x0,%rax
>> => 0x7f7ff68078e9 <pthread_mutex_lock+9>:       mov    0x10(%rax),%rdx
>> (gdb) info reg rax rdx
>> rax            0x7f7ffffffffe   140187732541438
>> rdx            0x0      0
>> (gdb) x/p 0x7f7ffffffffe
>> 0x7f7ffffffffe: Cannot access memory at address 0x7f7ffffffffe
>> ----------
>>
>> -I think gdb tells us that there is a valid struct pthread_mutex_t in memory.
>> -I think that 4 bytes are copied to the adress rax point to. In this
>> case rax points to the last page in the stack segment, crossing the
>> border to the next page, which is not mapped:
>>
>> 00007f7ffffe0000-
>> 00007f7fffffffff     128k 0000000000000000 rw-p-
>> (rwx) 1/0/0 00:00       0 -   [ stack ]
>>
>> Any idea about this?
>>
>> Regards,
>>
>> Stephan
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss