Re: [Linux-cluster] cluster failed after 53 hours

Marcelo Matus <mmatus@xxxxxxxxxxxxxxxxxxxxxx> · Fri, 21 Jan 2005 16:54:42 -0700

We also have some crashes when writting very large files, 5GB or so,
and it seems the problem occurs when we hit the GFS cache limit, where
the machine memory is 4GB (Dual Opteron).

Is there a way to tune the GFS cache to use less memory, let say a maximum
512MB, so we can debug the problem better?

And it is either the remote GFS cache or GNBD, since we can write 8GB or 
larger

files when GFS is mounted locally, ie, when we do the tests in the same 
machine

that exports the GFS device, via GNBD, to the rest of the nodes.

Marcelo

Patrick Caulfield wrote:

On Mon, Jan 17, 2005 at 05:31:33PM -0800, Daniel McNeil wrote:

My 3 node cluster ran tests for 53 hours before hitting a problem.

Attached is a patch to set the CMAN process to run at realtime priority, I'm not
sure if that's the right thing to do or not to be honest.

Neither am I sure whether your 48-53 hours is significant - it's possible that
memory may be an issue (only guessing but GFS caches locks like crazy, it may be
worth cutting this down a bit by tweaking

/proc/cluster/lock_dlm/drop_count    and/or
/proc/cluster/lock_dlm/drop_period

otherwise, the only way were gpoing to get to the bottom of this is to enable
"DEBUG_MEMB" in cman and see what it thinks is going on when the node is kicked
out of the cluster.

patrick

------------------------------------------------------------------------

--

Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster