On Tue, 2005-01-18 at 00:48, Patrick Caulfield wrote: > On Mon, Jan 17, 2005 at 05:31:33PM -0800, Daniel McNeil wrote: > > My 3 node cluster ran tests for 53 hours before hitting a problem. > > > > > > Node cl031 hit the 1st problem CMAN: killed by STARTTRANS or > > NOMINATE. There is a DLM assert on cl031 also, but that is > > after a whole bunch of debug output. The full logs are > > here (http://developer.osdl.org/daniel/GFS/test.12jan2005/) > > > > Any ideas on what is going on? > > > > Here is simplified output (in the README file): > > test started Jan Wed 12 17:18 > > hung after Fri Jan 14 22:00 > > > > cl031 got an error in just under 53 hours. > > ========================================== > > Jan 14 22:00:38 cl031 kernel: CMAN: node cl031a has been removed from the cluster : No response to messages > > It's the usual thing. missing messages. > > patrick There is an DLM ASSERT farther down in log that show error = -105 which is ENOBUFS. Is this happening after the node has decided to leave the cluster? I just want to make sure a out of memory problem isn't causing the problem. Daniel