Re: Corosync/Pacemaker on NetBSD

Stephan <stephanwib@xxxxxxxxxxxxxx> · Mon, 10 Dec 2012 12:31:11 +0100

Hi,

it might be a NetBSD related bug or something is interfering with its
pthread implementation. I found a 2.x release not compiling on NetBSD
some time ago, but you are right - the 2.1.0 release can be
successfully build.

There is another issue - this happens when I start it:

Dec 10 11:21:16 [21835] ctx4980gate2 corosync notice  [TOTEM ]
Initializing transport (UDP/IP Multicast).
Dec 10 11:21:16 [21835] ctx4980gate2 corosync notice  [TOTEM ]
Initializing transmit/receive security (NSS) crypto: none hash: none
Dec 10 11:21:16 [21835] ctx4980gate2 corosync error   [QB    ]
kevent(poll): Bad file descriptor (9)
Dec 10 11:21:16 [21835] ctx4980gate2 corosync warning [QB    ]
fd->poll: Bad file descriptor (9)
Dec 10 11:21:16 [21835] ctx4980gate2 corosync error   [QB    ]
kevent(poll): Bad file descriptor (9)
Dec 10 11:21:16 [21835] ctx4980gate2 corosync warning [QB    ]
fd->poll: Bad file descriptor (9)
Dec 10 11:21:16 [21835] ctx4980gate2 corosync error   [QB    ]
kevent(poll): Bad file descriptor (9)
Dec 10 11:21:16 [21835] ctx4980gate2 corosync warning [QB    ]
fd->poll: Bad file descriptor (9)
Dec 10 11:21:16 [21835] ctx4980gate2 corosync error   [QB    ]
kevent(poll): Bad file descriptor (9)
Dec 10 11:21:16 [21835] ctx4980gate2 corosync warning [QB    ]
fd->poll: Bad file descriptor (9)
Dec 10 11:21:16 [21835] ctx4980gate2 corosync error   [QB    ]
kevent(poll): Bad file descriptor (9)
Dec 10 11:21:16 [21835] ctx4980gate2 corosync warning [QB    ]
fd->poll: Bad file descriptor (9)
Dec 10 11:21:16 [21835] ctx4980gate2 corosync error   [QB    ]
kevent(poll): Bad file descriptor (9)
Dec 10 11:21:16 [21835] ctx4980gate2 corosync warning [QB    ]
fd->poll: Bad file descriptor (9)
Dec 10 11:21:16 [21835] ctx4980gate2 corosync error   [QB    ]
kevent(poll): Bad file descriptor (9)

Ideas?

Regards,

Stephan

2012/12/10 Jan Friesse <jfriesse@xxxxxxxxxx>:
> Stephan,
> do you think that it can be problem in NetBSD thread code itself?
> Because if so, I cannot do to much with that (other then advise you to
> try corosync 2.1.x + pacemaker 1.1, this is no longer based on plugins
> (uses cpg directly) so it should not fail and 2.1 was tested on NetBSD
> to at least compile and basic work). If you believe that this is problem
> in corosync, can you please try to run some kind of tool (I don't know
> if valgrind is available) to give me hint what is happening (like there
> is overwrite of memory, ...).
>
> Regards,
>   Honza
>
> Stephan napsal(a):
>> Hi Jan,
>>
>> this happens both using "corosync-cfgtool -l" or a file in service.d.
>> It seems that something hoses the threads internal data (TLS).
>> According to gdb, the pointer (&conn_info->addr) passed to
>> pthread_mutex_lock() (via %rdi) is correct. I added a syslog()
>> statement before the call to pthread_mutex_lock() and found the
>> program crashing in it. This happens because of libc愀 internal
>> synchronization for threaded programs, which also calls
>> pthread_mutex_lock().
>>
>> The crash happens here:
>>
>> (gdb) frame 0
>> #0  0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1
>> (gdb) x/5i pthread_mutex_lock
>>    0x7f7ff68078e0 <pthread_mutex_lock>: mov    %fs:0x0,%rax
>> => 0x7f7ff68078e9 <pthread_mutex_lock+9>:       mov    0x10(%rax),%rdx
>>    0x7f7ff68078ed <pthread_mutex_lock+13>:      xor    %eax,%eax
>>    0x7f7ff68078ef <pthread_mutex_lock+15>:      lock cmpxchg %rdx,0x10(%rdi)
>>    0x7f7ff68078f5 <pthread_mutex_lock+21>:      test   %rax,%rax
>> (gdb) info reg fs rax rdi
>> fs             0x0      0
>> rax            0x7f7ffffffffe   140187732541438
>> rdi            0x7f7ff738f050   140187585278032
>> (gdb) frame 1
>> #1  0x00007f7ff7002e14 in ipc_thread_active (conn=0x7f7ff738f000) at
>> coroipcs.c:465
>> 465             pthread_mutex_lock (&conn_info->mutex);
>> (gdb) p &conn_info->mutex
>> $2 = (pthread_mutex_t *) 0x7f7ff738f050
>>
>>
>>
>> Probably not easy to fix...
>>
>> Regards,
>>
>> Stephan
>>
>> 2012/12/10 Jan Friesse <jfriesse@xxxxxxxxxx>:
>>> Stephan,
>>> is this happening only with pacemaker, or is this general problem (with
>>> dynamically loading of plugins)? Can you test to load different plugin
>>> in runtime (like one of openais one) or try to configure to load
>>> pacemaker after start:
>>>
>>> service {
>>> name: pacemaker
>>> ver: 0
>>> }
>>>
>>> Regards,
>>>   Honza
>>>
>>> Stephan napsal(a):
>>>> Hi all,
>>>>
>>>> now that Corosync 1.x (1.4.4 in this case) works on NetBSD (6.0 amd64)
>>>> "out of the box", I compiled Pacemaker 1.0 and 1.1 and tried to run it
>>>> on top of corosync. Unfortunately, when I load Pacemaker using
>>>> "corosync-cfgtool -l pacemaker", corosync crashes with SIGSEGV.
>>>>
>>>> I already found this with gdb:
>>>>
>>>> -----8<--------
>>>> Core was generated by `corosync'.
>>>> Program terminated with signal 11, Segmentation fault.
>>>> #0  0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1
>>>> (gdb) bt full
>>>> #0  0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1
>>>> No symbol table info available.
>>>> #1  0x00007f7ff7002e14 in ipc_thread_active (conn=0x7f7ff5308000) at
>>>> coroipcs.c:465
>>>>         conn_info = 0x7f7ff5308000
>>>>         retval = 0
>>>> #2  pthread_ipc_consumer (conn=0x7f7ff5308000) at coroipcs.c:674
>>>>         conn_info = 0x7f7ff5308000
>>>>         header = <optimized out>
>>>>         coroipc_response_header = {size = 660260756, id = 5, error = 0}
>>>>         send_ok = <optimized out>
>>>>         new_message = <optimized out>
>>>>         sem_value = 0
>>>> #3  0x00007f7ff6809d75 in ?? () from /usr/lib/libpthread.so.1
>>>> No symbol table info available.
>>>> #4  0x00007f7ff60759f0 in ___lwp_park50 () from /usr/lib/libc.so.12
>>>> No symbol table info available.
>>>> Cannot access memory at address 0x7f7ff0000000
>>>> (gdb) frame 1
>>>> #1  0x00007f7ff7002e14 in ipc_thread_active (conn=0x7f7ff5308000) at
>>>> coroipcs.c:465
>>>> 465             pthread_mutex_lock (&conn_info->mutex);
>>>> (gdb) print &conn_info->mutex
>>>> $1 = (pthread_mutex_t *) 0x7f7ff5308050
>>>> (gdb) p *$
>>>> $2 = {ptm_magic = 858980355, ptm_errorcheck = 0 '\000', ptm_pad1 =
>>>> "\000\000", ptm_interlock = 0 '\000', ptm_pad2 = "\000\000", ptm_owner
>>>> = 0x0, ptm_waiters = 0x0, ptm_recursed = 0, ptm_spare2 = 0x0}
>>>> (gdb) frame 0
>>>> #0  0x00007f7ff68078e9 in pthread_mutex_lock () from /usr/lib/libpthread.so.1
>>>> (gdb) x/2i 0x00007f7ff68078e0
>>>>    0x7f7ff68078e0 <pthread_mutex_lock>: mov    %fs:0x0,%rax
>>>> => 0x7f7ff68078e9 <pthread_mutex_lock+9>:       mov    0x10(%rax),%rdx
>>>> (gdb) info reg rax rdx
>>>> rax            0x7f7ffffffffe   140187732541438
>>>> rdx            0x0      0
>>>> (gdb) x/p 0x7f7ffffffffe
>>>> 0x7f7ffffffffe: Cannot access memory at address 0x7f7ffffffffe
>>>> ----------
>>>>
>>>> -I think gdb tells us that there is a valid struct pthread_mutex_t in memory.
>>>> -I think that 4 bytes are copied to the adress rax point to. In this
>>>> case rax points to the last page in the stack segment, crossing the
>>>> border to the next page, which is not mapped:
>>>>
>>>> 00007f7ffffe0000-
>>>> 00007f7fffffffff     128k 0000000000000000 rw-p-
>>>> (rwx) 1/0/0 00:00       0 -   [ stack ]
>>>>
>>>> Any idea about this?
>>>>
>>>> Regards,
>>>>
>>>> Stephan
>>>> _______________________________________________
>>>> discuss mailing list
>>>> discuss@xxxxxxxxxxxx
>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss