Hi Angus,
Andrew
I'll try upgrading to the latest libqb tomorrow and see if I can reproduce this behavior with it. I was able to get a coredump by running corosync manually in the foreground (corosync -f):
There still isn't anything added to /var/lib/corosync however. What do I need to do to enable the fdata file to be created?
Thanks,
Andrew
From: "Angus Salkeld" <asalkeld@xxxxxxxxxx>
To: pacemaker@xxxxxxxxxxxxxxxxxxx, discuss@xxxxxxxxxxxx
Sent: Thursday, November 1, 2012 5:11:23 PM
Subject: Re: [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster
On 01/11/12 14:32 -0500, Andrew Martin wrote:
>Hi Honza,
>
>
>Thanks for the help. I enabled core dumps in /etc/security/limits.conf but didn't have a chance to reboot and apply the changes so I don't have a core dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID file to be generated? right now all that is in /var/lib/corosync are the ringid_XXX files. Do I need to set something explicitly in the corosync config to enable this logging?
>
>
>I did find find something else interesting with libqb this time. I compiled libqb 0.14.2 for use with the cluster. This time when corosync died I noticed the following in dmesg:
>Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide error ip:7f657a52e517 sp:7fffd5068858 error:0 in libqb.so.0.14.2[7f657a525000+1f000]
>This error was only present for one of the many other times corosync has died.
>
>
>I see that there is a newer version of libqb (0.14.3) out, but didn't see a fix for this particular bug. Could this libqb problem be related to the corosync to hang up? Here's the corresponding corosync log file (next time I should have a core dump as well):
>http://pastebin.com/5FLKg7We
Hi Andrew
I can't see much wrong with the log either. If you could run with the latest
(libqb-0.14.3) and post a backtrace if it still happens, that would be great.
Thanks
Angus
>
>
>Thanks,
>
>
>Andrew
>
>----- Original Message -----
>
>From: "Jan Friesse" <jfriesse@xxxxxxxxxx>
>To: "Andrew Martin" <amartin@xxxxxxxxxxx>
>Cc: discuss@xxxxxxxxxxxx, "The Pacemaker cluster resource manager" <pacemaker@xxxxxxxxxxxxxxxxxxx>
>Sent: Thursday, November 1, 2012 7:55:52 AM
>Subject: Re: Corosync 2.1.0 dies on both nodes in cluster
>
>Ansdrew,
>I was not able to find anything interesting (from corosync point of
>view) in configuration/logs (corosync related).
>
>What would be helpful:
>- if corosync died, there should be
>/var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please
>xz them and store somewhere (they are quiet large but well compressible).
>- If you are able to reproduce problem (what seems like you are), can
>you please allow generating of coredumps and store somewhere backtrace
>of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and
>way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and
>here thread apply all bt). If you are running distribution with ABRT
>support, you can also use ABRT to generate report.
>
>Regards,
>Honza
>
>Andrew Martin napsal(a):
>> Corosync died an additional 3 times during the night on storage1. I wrote a daemon to attempt and start it as soon as it fails, so only one of those times resulted in a STONITH of storage1.
>>
>> I enabled debug in the corosync config, so I was able to capture a period when corosync died with debug output:
>> http://pastebin.com/eAmJSmsQ
>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For reference, here is my Pacemaker configuration:
>> http://pastebin.com/DFL3hNvz
>>
>> It seems that an extra node, 16777343 "localhost" has been added to the cluster after storage1 was STONTIHed (must be the localhost interface on storage1). Is there anyway to prevent this?
>>
>> Does this help to determine why corosync is dying, and what I can do to fix it?
>>
>> Thanks,
>>
>> Andrew
>>
>> ----- Original Message -----
>>
>> From: "Andrew Martin" <amartin@xxxxxxxxxxx>
>> To: discuss@xxxxxxxxxxxx
>> Sent: Thursday, November 1, 2012 12:11:35 AM
>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster
>>
>>
>> Hello,
>>
>> I recently configured a 3-node fileserver cluster by building Corosync 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" nodes where the resources run (a DRBD disk, filesystem mount, and samba/nfs daemons), while the third node (storagequorum) is in standby mode and acts as a quorum node for the cluster. Today I discovered that corosync died on both storage0 and storage1 at the same time. Since corosync died, pacemaker shut down as well on both nodes. Because the cluster no longer had quorum (and the no-quorum-policy="freeze"), storagequorum was unable to STONITH either node and just left the resources frozen where they were running, on storage0. I cannot find any log information to determine why corosync crashed, and this is a disturbing problem as the cluster and its messaging layer must be stable. Below is my corosync configuration file as well as the corosync log file from each n!
o!
>de during
>this period.
>>
>> corosync.conf:
>> http://pastebin.com/vWQDVmg8
>> Note that I have two redundant rings. On one of them, I specify the IP address (in this example 10.10.10.7) so that it binds to the correct interface (since potentially in the future those machines may have two interfaces on the same subnet).
>>
>> corosync.log from storage0:
>> http://pastebin.com/HK8KYDDQ
>>
>> corosync.log from storage1:
>> http://pastebin.com/sDWkcPUz
>>
>> corosync.log from storagequorum (the DC during this period):
>> http://pastebin.com/uENQ5fnf
>>
>> Issuing service corosync start && service pacemaker start on storage0 and storage1 resolved the problem and allowed the nodes to successfully reconnect to the cluster. What other information can I provide to help diagnose this problem and prevent it from recurring?
>>
>> Thanks,
>>
>> Andrew Martin
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
>
>
>_______________________________________________
>Pacemaker mailing list: Pacemaker@xxxxxxxxxxxxxxxxxxx
>http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>Project Home: http://www.clusterlabs.org
>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>Bugs: http://bugs.clusterlabs.org
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss
To: pacemaker@xxxxxxxxxxxxxxxxxxx, discuss@xxxxxxxxxxxx
Sent: Thursday, November 1, 2012 5:11:23 PM
Subject: Re: [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster
On 01/11/12 14:32 -0500, Andrew Martin wrote:
>Hi Honza,
>
>
>Thanks for the help. I enabled core dumps in /etc/security/limits.conf but didn't have a chance to reboot and apply the changes so I don't have a core dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID file to be generated? right now all that is in /var/lib/corosync are the ringid_XXX files. Do I need to set something explicitly in the corosync config to enable this logging?
>
>
>I did find find something else interesting with libqb this time. I compiled libqb 0.14.2 for use with the cluster. This time when corosync died I noticed the following in dmesg:
>Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide error ip:7f657a52e517 sp:7fffd5068858 error:0 in libqb.so.0.14.2[7f657a525000+1f000]
>This error was only present for one of the many other times corosync has died.
>
>
>I see that there is a newer version of libqb (0.14.3) out, but didn't see a fix for this particular bug. Could this libqb problem be related to the corosync to hang up? Here's the corresponding corosync log file (next time I should have a core dump as well):
>http://pastebin.com/5FLKg7We
Hi Andrew
I can't see much wrong with the log either. If you could run with the latest
(libqb-0.14.3) and post a backtrace if it still happens, that would be great.
Thanks
Angus
>
>
>Thanks,
>
>
>Andrew
>
>----- Original Message -----
>
>From: "Jan Friesse" <jfriesse@xxxxxxxxxx>
>To: "Andrew Martin" <amartin@xxxxxxxxxxx>
>Cc: discuss@xxxxxxxxxxxx, "The Pacemaker cluster resource manager" <pacemaker@xxxxxxxxxxxxxxxxxxx>
>Sent: Thursday, November 1, 2012 7:55:52 AM
>Subject: Re: Corosync 2.1.0 dies on both nodes in cluster
>
>Ansdrew,
>I was not able to find anything interesting (from corosync point of
>view) in configuration/logs (corosync related).
>
>What would be helpful:
>- if corosync died, there should be
>/var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please
>xz them and store somewhere (they are quiet large but well compressible).
>- If you are able to reproduce problem (what seems like you are), can
>you please allow generating of coredumps and store somewhere backtrace
>of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and
>way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and
>here thread apply all bt). If you are running distribution with ABRT
>support, you can also use ABRT to generate report.
>
>Regards,
>Honza
>
>Andrew Martin napsal(a):
>> Corosync died an additional 3 times during the night on storage1. I wrote a daemon to attempt and start it as soon as it fails, so only one of those times resulted in a STONITH of storage1.
>>
>> I enabled debug in the corosync config, so I was able to capture a period when corosync died with debug output:
>> http://pastebin.com/eAmJSmsQ
>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For reference, here is my Pacemaker configuration:
>> http://pastebin.com/DFL3hNvz
>>
>> It seems that an extra node, 16777343 "localhost" has been added to the cluster after storage1 was STONTIHed (must be the localhost interface on storage1). Is there anyway to prevent this?
>>
>> Does this help to determine why corosync is dying, and what I can do to fix it?
>>
>> Thanks,
>>
>> Andrew
>>
>> ----- Original Message -----
>>
>> From: "Andrew Martin" <amartin@xxxxxxxxxxx>
>> To: discuss@xxxxxxxxxxxx
>> Sent: Thursday, November 1, 2012 12:11:35 AM
>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster
>>
>>
>> Hello,
>>
>> I recently configured a 3-node fileserver cluster by building Corosync 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" nodes where the resources run (a DRBD disk, filesystem mount, and samba/nfs daemons), while the third node (storagequorum) is in standby mode and acts as a quorum node for the cluster. Today I discovered that corosync died on both storage0 and storage1 at the same time. Since corosync died, pacemaker shut down as well on both nodes. Because the cluster no longer had quorum (and the no-quorum-policy="freeze"), storagequorum was unable to STONITH either node and just left the resources frozen where they were running, on storage0. I cannot find any log information to determine why corosync crashed, and this is a disturbing problem as the cluster and its messaging layer must be stable. Below is my corosync configuration file as well as the corosync log file from each n!
o!
>de during
>this period.
>>
>> corosync.conf:
>> http://pastebin.com/vWQDVmg8
>> Note that I have two redundant rings. On one of them, I specify the IP address (in this example 10.10.10.7) so that it binds to the correct interface (since potentially in the future those machines may have two interfaces on the same subnet).
>>
>> corosync.log from storage0:
>> http://pastebin.com/HK8KYDDQ
>>
>> corosync.log from storage1:
>> http://pastebin.com/sDWkcPUz
>>
>> corosync.log from storagequorum (the DC during this period):
>> http://pastebin.com/uENQ5fnf
>>
>> Issuing service corosync start && service pacemaker start on storage0 and storage1 resolved the problem and allowed the nodes to successfully reconnect to the cluster. What other information can I provide to help diagnose this problem and prevent it from recurring?
>>
>> Thanks,
>>
>> Andrew Martin
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
>
>
>_______________________________________________
>Pacemaker mailing list: Pacemaker@xxxxxxxxxxxxxxxxxxx
>http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>Project Home: http://www.clusterlabs.org
>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>Bugs: http://bugs.clusterlabs.org
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss