Anthony wrote:
Hello,
yesterday,
we had a full GFS system Fail,
all partitions were unaccessible from all the 32 nodes.
and now all the cluster is inaccessible.
did any one had already seen this problem?
GFS: Trying to join cluster "lock_gulm", "gen:ir"
GFS: fsid=gen:ir.32: Joined cluster. Now mounting FS...
GFS: fsid=gen:ir.32: jid=32: Trying to acquire journal lock...
GFS: fsid=gen:ir.32: jid=32: Looking at journal...
GFS: fsid=gen:ir.32: jid=32: Done
NETDEV WATCHDOG: jnet0: transmit timed out
ipmi_kcs_sm: kcs hosed: Not in read state for error2
NETDEV WATCHDOG: jnet0: transmit timed out
ipmi_kcs_sm: kcs hosed: Not in read state for error2
GFS: fsid=gen:ir.32: fatal: filesystem consistency error
GFS: fsid=gen:ir.32: function = trans_go_xmote_bh
GFS: fsid=gen:ir.32: file =
/usr/src/build/626614-x86_64/BUILD/gfs-kernel-2.6.9-42/smp/src/gfs/glops.c,
line = 542
GFS: fsid=gen:ir.32: time = 1150223491
GFS: fsid=gen:ir.32: about to withdraw from the cluster
GFS: fsid=gen:ir.32: waiting for outstanding I/O
GFS: fsid=gen:ir.32: telling LM to withdraw
Hi Anthony,
This problem could be caused by a couple of things. Basically, it
indicates a filesystem
consistency error occurred. In this particular case, it means that a
write was done to the
file system, and a transaction lock was taken out, but after the write
transaction, the journal
for the written data was found to be still in use. That means one of
two things:
Either (1) some process was writing to the GFS journal when they
shouldn't be (i.e. without
the necessary lock) or else (2) the journal data written was somehow
corrupted on disk.
In the past, we've often tracked down such problems to hardware
failures; in other words,
even without the GFS file system in the loop, if you use a command like
'dd' to send data to
the raw hard disk device, then use dd to retrieve it, the data comes
back from the hardware
different than what was written out. That particular scenario is
documented as bugzilla bug
175589.
I'm not saying that is your problem, but I'm saying that's what we've
seen in the past.
My recommendation is to read the bugzilla, back up your entire file
system or copy it to
a different set of drives, then perhaps you can do some hardware tests
as described in the
bugzilla to see whether your hardware can consistently write data, read
it back, and get
a match between what was written and what was read back. Do this test
without GFS in
there at all, and hopefully with only one node accessing that storage at
a time.
You will probably also want to run gfs_fsck before mounting again to
check the consistency
of the file system, just in case some rogue process on one of the nodes
was doing something
destructive.
WARNING: overwriting your GFS file system will of course damage what was
there,
so you better be careful not to destroy your data and make a copy before
doing this.
If the hardware checks out 100% and you can recreate the failure, open a
bugzilla against GFS
and we'll go from there. In other words, we don't know of any problems
with GFS that
can cause this, beyond hardware problems.
I hope this helps.
Regards,
Bob Peterson
Red Hat Cluster Suite
--
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster