----- "Gavin Conway" <gavin.conway@xxxxxxxxxxxxxxxxx> wrote: | We'll give this a go and see what it does. We did manage to track down | the latest issue to a bad script that the customer had written which | caused one of the nodes to exhaust all of its available memory. That | then caused a knock-on effect to the lock_dlm process which was unable | to drop it's file locks, which then rolled the affect on to the rest | of the cluster as they started being unable to open files. Hi Gavin, You could also try my hang analyzer to see if it finds anything: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/gfs2_hangalyzer.c Compile with: gcc -o gfs2_hangalyzer gfs2_hangalyzer.c Run with: ./gfs2_hangalyzer -n <any node in the cluster> This leaves a bunch of files in /tmp/ so you may want to clean them up. But be forewarned that you should have rsa keys set up ahead of time so you can ssh to all the nodes in your cluster without a password before running this tool. Regards, Bob Peterson Red Hat File Systems -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster