Re: [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster

Andrew Martin <amartin@xxxxxxxxxxx> · Mon, 05 Nov 2012 08:18:07 -0600 (CST)

Honza and Angus,
Here is the backtrace:
# ls -l /var/lib/corosync/
total 8
-rwx------ 1 root root 8 Nov  2 09:02 ringid_10.xxx.xxx.xxx
-rwxr-xr-x 1 root root 8 Nov  1 14:54 ringid_127.0.0.1

# ls -ltr /var/crash
total 47296
-rw-r----- 1 root whoopsie   266218 Oct 30 22:24 _usr_sbin_smbd.0.crash
-rw-r----- 1 root whoopsie   309850 Oct 30 22:25 _usr_libexec_pacemaker_lrmd.0.crash
-rw-r----- 1 root whoopsie   211640 Oct 30 22:25 _usr_sbin_nmbd.0.crash
-rw-r----- 1 root whoopsie   221656 Oct 31 22:43 _usr_sbin_corosync.0.crash
-rw------- 1 root whoopsie 23302144 Nov  1 17:05 core.corosync.0.1351807501.16625
-rw------- 1 root whoopsie 24375296 Nov  2 12:53 core.corosync.0.1351878781.28065

# with libqb 0.14.2
# gdb corosync /var/crash/core.corosync.0.1351807501.16625 
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /usr/sbin/corosync...(no debugging symbols found)...done.
[New LWP 16625]
[New LWP 16626]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `corosync -f'.
Program terminated with signal 7, Bus error.
#0  0x00007f36f6a09d43 in qb_rb_chunks_used () from /usr/lib/libqb.so.0
(gdb) thread apply all bt

Thread 2 (Thread 0x7f36f44a9700 (LWP 16626)):
#0  0x00007f36f67f0fd0 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f36f6a12ff3 in qb_log_init () from /usr/lib/libqb.so.0
#2  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f36f729b700 (LWP 16625)):
#0  0x00007f36f6a09d43 in qb_rb_chunks_used () from /usr/lib/libqb.so.0
#1  0x00007f36f44ac463 in ?? ()
#2  0x00007f36f6c249f0 in ?? () from /usr/lib/libqb.so.0
#3  0x000000000000002f in ?? ()
#4  0x00007f36f9226e90 in ?? ()
#5  0x00007f36f91b0600 in ?? ()
#6  0x00007f36f6a135b9 in qb_log_thread_stop () from /usr/lib/libqb.so.0
#7  0x0000000000000002 in ?? ()
#8  0x00007f36f6c249f0 in ?? () from /usr/lib/libqb.so.0
#9  0x00007f36f9226e90 in ?? ()
#10 0x00007f36f6c20920 in ?? () from /usr/lib/libqb.so.0
#11 0x00007fff35232ee8 in ?? ()
#12 0x0000000000000000 in ?? ()

# with libqb 0.14.3
# gdb corosync /var/crash/core.corosync.0.1351878781.28065 
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /usr/sbin/corosync...(no debugging symbols found)...done.
[New LWP 28065]
[New LWP 28066]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `corosync -f'.
Program terminated with signal 7, Bus error.
#0  0x00007f5174840bbd in qb_rb_space_free () from /usr/lib/libqb.so.0
(gdb) thread apply all bt

Thread 2 (Thread 0x7f51722e0700 (LWP 28066)):
#0  0x00007f5174627fd0 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f517484a0f3 in ?? () from /usr/lib/libqb.so.0
#2  0x00007f5174621e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f517434f4bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f51750d2700 (LWP 28065)):
#0  0x00007f5174840bbd in qb_rb_space_free () from /usr/lib/libqb.so.0
#1  0x00007f5174840d90 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0
#2  0x00007f517484a6b9 in ?? () from /usr/lib/libqb.so.0
#3  0x00007f51748487ba in qb_log_real_va_ () from /usr/lib/libqb.so.0
#4  0x00007f51751016f0 in ?? ()
#5  0x00007f5174cacdf6 in ?? () from /usr/lib/libtotem_pg.so.5
#6  0x00007f5174ca6a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5
#7  0x00007f5174ca18e2 in ?? () from /usr/lib/libtotem_pg.so.5
#8  0x00007f517484246f in ?? () from /usr/lib/libqb.so.0
#9  0x00007f5174841fe7 in qb_loop_run () from /usr/lib/libqb.so.0
#10 0x00007f51750f0935 in main ()

I see qb_rb_space_free is defined as "The amount of free space in the ring buffer".

Thanks,

Andrew

From: "Jan Friesse" <jfriesse@xxxxxxxxxx>
To: pacemaker@xxxxxxxxxxxxxxxxxxx, discuss@xxxxxxxxxxxx
Sent: Monday, November 5, 2012 2:21:09 AM
Subject: Re: [Pacemaker]  Corosync 2.1.0 dies on both nodes in        cluster

Angus Salkeld napsal(a):
> On 02/11/12 13:07 -0500, Andrew Martin wrote:
>> Hi Angus,
>>
>>
>> Corosync died again while using libqb 0.14.3. Here is the coredump
>> from today:
>> http://sources.xes-inc.com/downloads/corosync.nov2.coredump
>>
>>
>>
>> # corosync -f
>> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to
>> provide service.
>> info [MAIN ] Corosync built-in features: pie relro bindnow
>> Bus error (core dumped)
>>
>>
>> Here's the log: http://pastebin.com/bUfiB3T3
>>
>>
>> Did your analysis of the core dump reveal anything?
>>
> 
> I can't get any symbols out of these coredumps. Can you try get a
> backtrace?
> 

Andrew,
as I've wrote in original mail, backtrace can be got by:

coredumps are stored in /var/lib/corosync as core.PID, and
way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and
here thread apply all bt). If you are running distribution with ABRT
support, you can also use ABRT to generate report.

It's also pretty weird that you are getting SIGBUS. SIGBUS is pretty
usually result of accessing unaligned memory on processors without
support to access that (for example Sparc). This doesn't seem to be your
case (because of AMD64).

>>
>> Is there a way for me to make it generate fdata with a bus error, or
>> how else can I gather additional information to help debug this?
>>
> 
> if you look in exec/main.c and look for SIGSEGV you will see how the
> mechanism
> for fdata works. Just and a handler for SIGBUS and hook it up. Then you
> should
> be able to get the fdata for both.
> 
> I'd rather be able to get a backtrace if possible.
> 

Also if possible, please try to compile with --enable-debug (both libqb
and corosync) to get as much information as possible.

> -Angus
> 

Regards,
  Honza

>>
>> Thanks,
>>
>>
>> Andrew
>>
>> ----- Original Message -----
>>
>> From: "Angus Salkeld" <asalkeld@xxxxxxxxxx>
>> To: pacemaker@xxxxxxxxxxxxxxxxxxx, discuss@xxxxxxxxxxxx
>> Sent: Thursday, November 1, 2012 5:47:16 PM
>> Subject: Re:  [Pacemaker] Corosync 2.1.0 dies on both nodes
>> in cluster
>>
>> On 01/11/12 17:27 -0500, Andrew Martin wrote:
>>> Hi Angus,
>>>
>>>
>>> I'll try upgrading to the latest libqb tomorrow and see if I can
>>> reproduce this behavior with it. I was able to get a coredump by
>>> running corosync manually in the foreground (corosync -f):
>>> http://sources.xes-inc.com/downloads/corosync.coredump
>>
>> Thanks, looking...
>>
>>>
>>>
>>> There still isn't anything added to /var/lib/corosync however. What
>>> do I need to do to enable the fdata file to be created?
>>
>> Well if it crashes with SIGSEGV it will generate it automatically.
>> (I see you are getting a bus error) - :(.
>>
>> -A
>>
>>>
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>> ----- Original Message -----
>>>
>>> From: "Angus Salkeld" <asalkeld@xxxxxxxxxx>
>>> To: pacemaker@xxxxxxxxxxxxxxxxxxx, discuss@xxxxxxxxxxxx
>>> Sent: Thursday, November 1, 2012 5:11:23 PM
>>> Subject: Re:  [Pacemaker] Corosync 2.1.0 dies on both nodes
>>> in cluster
>>>
>>> On 01/11/12 14:32 -0500, Andrew Martin wrote:
>>>> Hi Honza,
>>>>
>>>>
>>>> Thanks for the help. I enabled core dumps in
>>>> /etc/security/limits.conf but didn't have a chance to reboot and
>>>> apply the changes so I don't have a core dump this time. Do core
>>>> dumps need to be enabled for the fdata-DATETIME-PID file to be
>>>> generated? right now all that is in /var/lib/corosync are the
>>>> ringid_XXX files. Do I need to set something explicitly in the
>>>> corosync config to enable this logging?
>>>>
>>>>
>>>> I did find find something else interesting with libqb this time. I
>>>> compiled libqb 0.14.2 for use with the cluster. This time when
>>>> corosync died I noticed the following in dmesg:
>>>> Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap
>>>> divide error ip:7f657a52e517 sp:7fffd5068858 error:0 in
>>>> libqb.so.0.14.2[7f657a525000+1f000]
>>>> This error was only present for one of the many other times corosync
>>>> has died.
>>>>
>>>>
>>>> I see that there is a newer version of libqb (0.14.3) out, but
>>>> didn't see a fix for this particular bug. Could this libqb problem
>>>> be related to the corosync to hang up? Here's the corresponding
>>>> corosync log file (next time I should have a core dump as well):
>>>> http://pastebin.com/5FLKg7We
>>>
>>> Hi Andrew
>>>
>>> I can't see much wrong with the log either. If you could run with the
>>> latest
>>> (libqb-0.14.3) and post a backtrace if it still happens, that would
>>> be great.
>>>
>>> Thanks
>>> Angus
>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Andrew
>>>>
>>>> ----- Original Message -----
>>>>
>>>> From: "Jan Friesse" <jfriesse@xxxxxxxxxx>
>>>> To: "Andrew Martin" <amartin@xxxxxxxxxxx>
>>>> Cc: discuss@xxxxxxxxxxxx, "The Pacemaker cluster resource manager"
>>>> <pacemaker@xxxxxxxxxxxxxxxxxxx>
>>>> Sent: Thursday, November 1, 2012 7:55:52 AM
>>>> Subject: Re:  Corosync 2.1.0 dies on both nodes in cluster
>>>>
>>>> Ansdrew,
>>>> I was not able to find anything interesting (from corosync point of
>>>> view) in configuration/logs (corosync related).
>>>>
>>>> What would be helpful:
>>>> - if corosync died, there should be
>>>> /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please
>>>> xz them and store somewhere (they are quiet large but well
>>>> compressible).
>>>> - If you are able to reproduce problem (what seems like you are), can
>>>> you please allow generating of coredumps and store somewhere backtrace
>>>> of coredump? (coredumps are stored in /var/lib/corosync as core.PID,
>>>> and
>>>> way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and
>>>> here thread apply all bt). If you are running distribution with ABRT
>>>> support, you can also use ABRT to generate report.
>>>>
>>>> Regards,
>>>> Honza
>>>>
>>>> Andrew Martin napsal(a):
>>>>> Corosync died an additional 3 times during the night on storage1. I
>>>>> wrote a daemon to attempt and start it as soon as it fails, so only
>>>>> one of those times resulted in a STONITH of storage1.
>>>>>
>>>>> I enabled debug in the corosync config, so I was able to capture a
>>>>> period when corosync died with debug output:
>>>>> http://pastebin.com/eAmJSmsQ
>>>>> In this example, Pacemaker finishes shutting down by Nov 01
>>>>> 05:53:02. For reference, here is my Pacemaker configuration:
>>>>> http://pastebin.com/DFL3hNvz
>>>>>
>>>>> It seems that an extra node, 16777343 "localhost" has been added to
>>>>> the cluster after storage1 was STONTIHed (must be the localhost
>>>>> interface on storage1). Is there anyway to prevent this?
>>>>>
>>>>> Does this help to determine why corosync is dying, and what I can
>>>>> do to fix it?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Andrew
>>>>>
>>>>> ----- Original Message -----
>>>>>
>>>>> From: "Andrew Martin" <amartin@xxxxxxxxxxx>
>>>>> To: discuss@xxxxxxxxxxxx
>>>>> Sent: Thursday, November 1, 2012 12:11:35 AM
>>>>> Subject:  Corosync 2.1.0 dies on both nodes in cluster
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> I recently configured a 3-node fileserver cluster by building
>>>>> Corosync 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes
>>>>> are running Ubuntu 12.04 amd64. Two of the nodes (storage0 and
>>>>> storage1) are "real" nodes where the resources run (a DRBD disk,
>>>>> filesystem mount, and samba/nfs daemons), while the third node
>>>>> (storagequorum) is in standby mode and acts as a quorum node for
>>>>> the cluster. Today I discovered that corosync died on both storage0
>>>>> and storage1 at the same time. Since corosync died, pacemaker shut
>>>>> down as well on both nodes. Because the cluster no longer had
>>>>> quorum (and the no-quorum-policy="freeze"), storagequorum was
>>>>> unable to STONITH either node and just left the resources frozen
>>>>> where they were running, on storage0. I cannot find any log
>>>>> information to determine why corosync crashed, and this is a
>>>>> disturbing problem as the cluster and its messaging layer must be
>>>>> stable. Below is my corosync configuration file as well as the
>>>>> corosync log file from each!
>  !
>> n!
>>> o!
>>>> de during
>>>> this period.
>>>>>
>>>>> corosync.conf:
>>>>> http://pastebin.com/vWQDVmg8
>>>>> Note that I have two redundant rings. On one of them, I specify the
>>>>> IP address (in this example 10.10.10.7) so that it binds to the
>>>>> correct interface (since potentially in the future those machines
>>>>> may have two interfaces on the same subnet).
>>>>>
>>>>> corosync.log from storage0:
>>>>> http://pastebin.com/HK8KYDDQ
>>>>>
>>>>> corosync.log from storage1:
>>>>> http://pastebin.com/sDWkcPUz
>>>>>
>>>>> corosync.log from storagequorum (the DC during this period):
>>>>> http://pastebin.com/uENQ5fnf
>>>>>
>>>>> Issuing service corosync start && service pacemaker start on
>>>>> storage0 and storage1 resolved the problem and allowed the nodes to
>>>>> successfully reconnect to the cluster. What other information can I
>>>>> provide to help diagnose this problem and prevent it from recurring?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Andrew Martin
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list
>>>>> discuss@xxxxxxxxxxxx
>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list
>>>>> discuss@xxxxxxxxxxxx
>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>
>>>>
>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker@xxxxxxxxxxxxxxxxxxx
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
>>
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
Pacemaker mailing list: Pacemaker@xxxxxxxxxxxxxxxxxxx
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss