On Wed, Feb 23, 2005 at 03:12:22PM -0800, Daniel McNeil wrote: > On Tue, 2005-02-22 at 00:35, Patrick Caulfield wrote: > > On Mon, Feb 21, 2005 at 05:34:23PM -0800, Daniel McNeil wrote: > > > My latest test ran 49 hours before a node got kicked out. > > My test is doing a bunch of mount/umount's, so that is causing > the service join/leave for the DLM lock space and file system > mount group. > > Are you saying leaving a DLM lock space causes a DLM recovery and > that is what is leading to the 'No response to messages' ? Indirectly, we think. > How does DLM hogging a cpu lead to 'no response'? > BTW, this is running on 2 proc machines, so hogging one cpu > still leaves on available. Hmm that makes it more, er, interesting. > Who is not responding to what message? It's cman that has sent a message to another node, to process the join. It hasn't got back an ACK for that message after max_retries (default 3) attempts. -- patrick