Re: "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2

Charlie Brady <charlieb-linux-cluster@xxxxxxxxxxxxxxxxxx> · Fri, 5 Jun 2009 13:20:11 -0400 (EDT)

On Fri, 5 Jun 2009, Steven Dake wrote:

On Fri, 2009-06-05 at 11:49 -0500, David Teigland wrote:
On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote:

On Fri, 5 Jun 2009, David Teigland wrote:

On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote:

On Fri, 5 Jun 2009, David Teigland wrote:

They are all complaining that the the cluster is down, which is a polite
way
of saying that aisexec has died/crashed/failed/killed/gone-away.

Thanks. Why might that have occurred? Where would I look for clues? How
can I increase logging output from aisexec?

If you're lucky it'll leave a core file, otherwise aisexec is notorious for
disappearing without leaving any clues about why.

That's very disconcerting to hear. Doesn't sound like HA. :-(

To clarify, aisexec does not often disappear, it's very reliable.  The point
was that in the rare case when it does, it's notorious for not leaving any
reasons behind.

Dave

99.9% of the time there would be a core file in /var/lib/openais/core*
if aisexec faults.

Only file I have there is named.

ringid_10.39.171.212

 We have not seen faults during normal operations for
years in a released version under typical gfs2 usage scenarios.  If
there is no core, it means some other component failed, exited, and
caused that node to be fenced, or the core file could not be written by
the OS because of some other OS specific failure.  Another option is
that the OOM killer killed aisexec.

No sign of the oom killer in the log I quoted yesterday.

 I would have a hard time believing
aisexec would crash without a core file while the operating system was
still functional.

In the trunk we are enhancing our failure analysis to do fulltime event
tracing so failures can be debugged more rapidly then looking at a core
file.  I hope that helps.

Thanks.

I'll try to reproduce the scenario. Meanwhile I'm still looking for hints 
as to how to get more visibility of what is happening.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster