On Mon, Aug 25, 2008 at 07:29:41PM -0400, Ross Vandegrift wrote: > Today, the app on one node died. I logged in, assumed things were > fenced, and tried to go about my business of restarting it. After > some fiddling, I got the box back in the cluster fine. > > It just happened again, and I've dug in a bit more. I was wrong - the > failed node has not been fenced. The last thing in dmesg on the > failing node is: Some more information gleaned today. I left the node running last night without fixing the GFS2 access. Today, we noticed that filesystem access has been restored for new processes - it's slow (sometimes taking minutes to return an ls for 10 items), but it eventually responds. The application threads that are sleeping in D still haven't received their data from reads issued yesterday afternoon. A cursory examination of DLM-related keys in /sys reveal that the working and broken nodes are configured the same. No major disparity in terms of memory use, except the obvious fact that the broken node shows very litte disk IO. I'm pretty much at a loss - any ideas would be very welcome. -- Ross Vandegrift ross@xxxxxxxxxxx "The good Christian should beware of mathematicians, and all those who make empty prophecies. The danger already exists that the mathematicians have made a covenant with the devil to darken the spirit and to confine man in the bonds of Hell." --St. Augustine, De Genesi ad Litteram, Book II, xviii, 37 -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster