I have a three node cluster running the latest RHEL4U4 which has been running seamlessly until this evening when one of the gfs file systems had a problem and failed totally on node, jrmedia-a. Executing any command resulted in an IO Error. The resource manager noticed this failure but could not relocate the corresponding service to another node. Can anyone shed some light on what happened? I have unmounted and the remounted the file system on the node and stopped and started the service and everything seemed to return to normal. Here are the relevant log messages from all three nodes. jrmedia-a Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0: fatal: filesystem consistency error Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0: RG = 18652911 Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0: function = gfs_setbit Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0: file = /builddir/build/BUILD/gfs-kernel-2.6.9-60/smp/src/gfs/bits.c, line = 71 Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0: time = 1164066728 Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0: about to withdraw from the cluster Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0: waiting for outstanding I/O Nov 20 23:52:08 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0: telling LM to withdraw Nov 20 23:52:11 jrmedia-a kernel: lock_dlm: withdraw abandoned memory Nov 20 23:52:11 jrmedia-a kernel: GFS: fsid=alpha_cluster:customers.0: withdrawn Nov 20 23:52:47 jrmedia-a clurgmgrd[4938]: <notice> status on clusterfs "customersfs" returned 1 (generic error) Nov 20 23:52:47 jrmedia-a clurgmgrd[4938]: <notice> Stopping service customers Nov 20 23:52:48 jrmedia-a clurgmgrd: [4938]: <info> Removing IPv4 address 10.0.20.56 from eth1 Nov 20 23:52:58 jrmedia-a clurgmgrd: [4938]: <err> /mnt/customers is not a directory Nov 20 23:52:58 jrmedia-a clurgmgrd[4938]: <notice> stop on nfsclient "read-write" returned 2 (invalid argument(s)) Nov 20 23:52:59 jrmedia-a clurgmgrd[4938]: <crit> #12: RG customers failed to stop; intervention required Nov 20 23:52:59 jrmedia-a clurgmgrd[4938]: <notice> Service customers is failed jrmedia-b Nov 20 23:52:09 jrmedia-b kernel: GFS: fsid=alpha_cluster:customers.1: jid=0: Trying to acquire journal lock... Nov 20 23:52:09 jrmedia-b kernel: GFS: fsid=alpha_cluster:customers.1: jid=0: Busy Nov 20 23:53:58 jrmedia-b kernel: dlm: customers: process_lockqueue_reply id 7a601e3 state 0 jrmedia-c Nov 20 23:52:09 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2: jid=0: Trying to acquire journal lock... Nov 20 23:52:09 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2: jid=0: Looking at journal... Nov 20 23:52:09 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2: jid=0: Acquiring the transaction lock... Nov 20 23:52:10 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2: jid=0: Replaying journal... Nov 20 23:52:10 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2: jid=0: Replayed 0 of 8 blocks Nov 20 23:52:10 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2: jid=0: replays = 0, skips = 3, sames = 5 Nov 20 23:52:10 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2: jid=0: Journal replayed in 1s Nov 20 23:52:10 jrmedia-c kernel: GFS: fsid=alpha_cluster:customers.2: jid=0: Done Ben -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster