Re: Corosync/Pacemaker on NetBSD

Jan Friesse <jfriesse@xxxxxxxxxx> · Mon, 10 Dec 2012 11:55:17 +0100

Stephan,
do you think that it can be problem in NetBSD thread code itself?
Because if so, I cannot do to much with that (other then advise you to
try corosync 2.1.x + pacemaker 1.1, this is no longer based on plugins
(uses cpg directly) so it should not fail and 2.1 was tested on NetBSD
to at least compile and basic work). If you believe that this is problem
in corosync, can you please try to run some kind of tool (I don't know
if valgrind is available) to give me hint what is happening (like there
is overwrite of memory, ...).

Regards,
  Honza

Stephan napsal(a):
> Hi Jan,
> 
> this happens both using "corosync-cfgtool -l" or a file in service.d.
> It seems that something hoses the threads internal data (TLS).
> According to gdb, the pointer (&conn_info->addr) passed to
> pthread_mutex_lock() (via %rdi) is correct. I added a syslog()
> statement before the call to pthread_mutex_lock() and found the
> program crashing in it. This happens because of libc´s internal
> synchronization for threaded programs, which also calls
> pthread_mutex_lock().
> 
> The crash happens here:
> 
> (gdb) frame 0
> #0  0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1
> (gdb) x/5i pthread_mutex_lock
>    0x7f7ff68078e0 <pthread_mutex_lock>: mov    %fs:0x0,%rax
> => 0x7f7ff68078e9 <pthread_mutex_lock+9>:       mov    0x10(%rax),%rdx
>    0x7f7ff68078ed <pthread_mutex_lock+13>:      xor    %eax,%eax
>    0x7f7ff68078ef <pthread_mutex_lock+15>:      lock cmpxchg %rdx,0x10(%rdi)
>    0x7f7ff68078f5 <pthread_mutex_lock+21>:      test   %rax,%rax
> (gdb) info reg fs rax rdi
> fs             0x0      0
> rax            0x7f7ffffffffe   140187732541438
> rdi            0x7f7ff738f050   140187585278032
> (gdb) frame 1
> #1  0x00007f7ff7002e14 in ipc_thread_active (conn=0x7f7ff738f000) at
> coroipcs.c:465
> 465             pthread_mutex_lock (&conn_info->mutex);
> (gdb) p &conn_info->mutex
> $2 = (pthread_mutex_t *) 0x7f7ff738f050
> 
> 
> 
> Probably not easy to fix...
> 
> Regards,
> 
> Stephan
> 
> 2012/12/10 Jan Friesse <jfriesse@xxxxxxxxxx>:
>> Stephan,
>> is this happening only with pacemaker, or is this general problem (with
>> dynamically loading of plugins)? Can you test to load different plugin
>> in runtime (like one of openais one) or try to configure to load
>> pacemaker after start:
>>
>> service {
>> name: pacemaker
>> ver: 0
>> }
>>
>> Regards,
>>   Honza
>>
>> Stephan napsal(a):
>>> Hi all,
>>>
>>> now that Corosync 1.x (1.4.4 in this case) works on NetBSD (6.0 amd64)
>>> "out of the box", I compiled Pacemaker 1.0 and 1.1 and tried to run it
>>> on top of corosync. Unfortunately, when I load Pacemaker using
>>> "corosync-cfgtool -l pacemaker", corosync crashes with SIGSEGV.
>>>
>>> I already found this with gdb:
>>>
>>> -----8<--------
>>> Core was generated by `corosync'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1
>>> (gdb) bt full
>>> #0  0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1
>>> No symbol table info available.
>>> #1  0x00007f7ff7002e14 in ipc_thread_active (conn=0x7f7ff5308000) at
>>> coroipcs.c:465
>>>         conn_info = 0x7f7ff5308000
>>>         retval = 0
>>> #2  pthread_ipc_consumer (conn=0x7f7ff5308000) at coroipcs.c:674
>>>         conn_info = 0x7f7ff5308000
>>>         header = <optimized out>
>>>         coroipc_response_header = {size = 660260756, id = 5, error = 0}
>>>         send_ok = <optimized out>
>>>         new_message = <optimized out>
>>>         sem_value = 0
>>> #3  0x00007f7ff6809d75 in ?? () from /usr/lib/libpthread.so.1
>>> No symbol table info available.
>>> #4  0x00007f7ff60759f0 in ___lwp_park50 () from /usr/lib/libc.so.12
>>> No symbol table info available.
>>> Cannot access memory at address 0x7f7ff0000000
>>> (gdb) frame 1
>>> #1  0x00007f7ff7002e14 in ipc_thread_active (conn=0x7f7ff5308000) at
>>> coroipcs.c:465
>>> 465             pthread_mutex_lock (&conn_info->mutex);
>>> (gdb) print &conn_info->mutex
>>> $1 = (pthread_mutex_t *) 0x7f7ff5308050
>>> (gdb) p *$
>>> $2 = {ptm_magic = 858980355, ptm_errorcheck = 0 '\000', ptm_pad1 =
>>> "\000\000", ptm_interlock = 0 '\000', ptm_pad2 = "\000\000", ptm_owner
>>> = 0x0, ptm_waiters = 0x0, ptm_recursed = 0, ptm_spare2 = 0x0}
>>> (gdb) frame 0
>>> #0  0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1
>>> (gdb) x/2i 0x00007f7ff68078e0
>>>    0x7f7ff68078e0 <pthread_mutex_lock>: mov    %fs:0x0,%rax
>>> => 0x7f7ff68078e9 <pthread_mutex_lock+9>:       mov    0x10(%rax),%rdx
>>> (gdb) info reg rax rdx
>>> rax            0x7f7ffffffffe   140187732541438
>>> rdx            0x0      0
>>> (gdb) x/p 0x7f7ffffffffe
>>> 0x7f7ffffffffe: Cannot access memory at address 0x7f7ffffffffe
>>> ----------
>>>
>>> -I think gdb tells us that there is a valid struct pthread_mutex_t in memory.
>>> -I think that 4 bytes are copied to the adress rax point to. In this
>>> case rax points to the last page in the stack segment, crossing the
>>> border to the next page, which is not mapped:
>>>
>>> 00007f7ffffe0000-
>>> 00007f7fffffffff     128k 0000000000000000 rw-p-
>>> (rwx) 1/0/0 00:00       0 -   [ stack ]
>>>
>>> Any idea about this?
>>>
>>> Regards,
>>>
>>> Stephan
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss