On Fri, 2009-06-05 at 11:49 -0500, David Teigland wrote: > On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote: > > > > On Fri, 5 Jun 2009, David Teigland wrote: > > > > >On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote: > > >> > > >>On Fri, 5 Jun 2009, David Teigland wrote: > > >> > > >>>On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote: > > >>>>Jun 4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, > > >>>>exiting > > >>>>Jun 4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting > > >>>>Jun 4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, > > >>>>exiting > > >>>>Jun 4 10:55:35 sun4150node1 qdiskd[8128]: <err> cman_dispatch: Host is > > >>>>down > > >>> > > >>>They are all complaining that the the cluster is down, which is a polite > > >>>way > > >>>of saying that aisexec has died/crashed/failed/killed/gone-away. > > >> > > >>Thanks. Why might that have occurred? Where would I look for clues? How > > >>can I increase logging output from aisexec? > > > > > >If you're lucky it'll leave a core file, otherwise aisexec is notorious for > > >disappearing without leaving any clues about why. > > > > That's very disconcerting to hear. Doesn't sound like HA. :-( > > To clarify, aisexec does not often disappear, it's very reliable. The point > was that in the rare case when it does, it's notorious for not leaving any > reasons behind. > > Dave > 99.9% of the time there would be a core file in /var/lib/openais/core* if aisexec faults. We have not seen faults during normal operations for years in a released version under typical gfs2 usage scenarios. If there is no core, it means some other component failed, exited, and caused that node to be fenced, or the core file could not be written by the OS because of some other OS specific failure. Another option is that the OOM killer killed aisexec. I would have a hard time believing aisexec would crash without a core file while the operating system was still functional. In the trunk we are enhancing our failure analysis to do fulltime event tracing so failures can be debugged more rapidly then looking at a core file. I hope that helps. regards -steve > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster