Hi, On Tue, 2008-08-26 at 13:50 -0400, Ross Vandegrift wrote: > On Mon, Aug 25, 2008 at 07:29:41PM -0400, Ross Vandegrift wrote: > > Today, the app on one node died. I logged in, assumed things were > > fenced, and tried to go about my business of restarting it. After > > some fiddling, I got the box back in the cluster fine. > > > > It just happened again, and I've dug in a bit more. I was wrong - the > > failed node has not been fenced. The last thing in dmesg on the > > failing node is: > > Some more information gleaned today. I left the node running last > night without fixing the GFS2 access. Today, we noticed that > filesystem access has been restored for new processes - it's slow > (sometimes taking minutes to return an ls for 10 items), but > it eventually responds. The application threads that are sleeping in D > still haven't received their data from reads issued yesterday > afternoon. > > A cursory examination of DLM-related keys in /sys reveal that the > working and broken nodes are configured the same. No major disparity > in terms of memory use, except the obvious fact that the broken node > shows very litte disk IO. > > I'm pretty much at a loss - any ideas would be very welcome. > There are a few things to check. Firstly compare /proc/slabinfo on a slow node with that on a node running at normal speed. That will tell you if there is a problem with memory leaking or not being reclaimed properly. If the node seems stuck, then try and echo t >/proc/sysrq-trigger and look at the backtraces of any process which has called into gfs2 to see where they are waiting. Also a dump of the glocks (you'll need to have debugfs mounted) on all nodes should then allow you to work out whether something on the stuck nodes is waiting for something on one of the other nodes. Sometimes its useful to look at the DLM locks as well. If you feel that you'd rather not go through all the details yourself, then please put the info into a bugzilla entry and we'll take a look, Steve. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster