Re: Found unlinked inode

Bob Peterson <rpeterso@xxxxxxxxxx> · Wed, 26 Sep 2007 09:01:08 -0500

On Wed, 2007-09-26 at 10:33 +0200, Borgström Jonas wrote:
> Hi again,
> 
> After stress testing a gfs filesystem for 24 hours fsck.gfs complains about "Found unlinked inode".
> This scared me so I reran the test again but got the same result. 
> 
> My test consists of two nodes running bonnie++, postgresql and pgbench against a single file system. Every five
> minutes one of the nodes is shot.
> 
> The weird part is that on both occasions the thing fsck.gfs complained about was an "unlinked inode" corresponding to a postgresql pid file. This is a file created (and deleted) every time postgresql is failed over to another node. It is also the last file on the filesystem being deleted when postgresql was shutdown before the filesystem was umounted and fsck.gfs was run.
> 
> Can anybody explain why this pid file triggered this fsck error twice and not any of the thousands of files created and deleted by bonnie++?
> 
> Does this mean the filesystem is corrupt, or is this an expected behavior for files deleted directly before a filesystem is umounted?
> 
> BTW: I'm not able to reproduce this by simply mounting the filesystem, starting/stopping pgsql and umounting. I need to leave the test running over night. I've also performed some tests directly on the SAN device and as far as I can tell it's working as expected.
> 
> OS: RHEL5 Advanced platform

Hi Jonas,

Well, I can think of one possible explanation.  I can't be sure because
I don't know your test scenario, but this is my theory.  First, a bit
of background:

When a node gets "shot" as you say, the metadata for some of its
recent operations are likely to still be in the journal it was using.
Depending on the circumstances of when it gets shot, that data may
exist only in the journal if it got "shot" before the data was
written to its final destination on disk.

Ordinarily, that's not a big deal because the next time the file
system is mounted, the journal is replayed and that causes the metadata
to be written correctly to its proper place on disk and all is well.
That's the same for most journaling file systems afaik.

A couple years ago, one of my predecessors (before I started) made an
executive decision to make gfs_fsck *clear* the system journals rather
than replay the journals.  I don't know offhand if the code was once
there and got taken out or if it was never written.  At any rate, it
seemed like a good idea at the time and there were several good
reasons to justify that decision:

First, if the user is running gfs_fsck, they must already suspect
file system corruption.  If (and this is a big if) that corruption was
caused by recent operations to the file system, then replaying the
journal can only serve to compound the corruption and cause more
corruption.  That's because what is in the journal may also be based on
the corruption.  This was more of a concern if, for some reason,
GFS bailed out and "withdrew" from the file system because it detected
corruption, suspecting that it must have somehow caused that corruption.

Second, if the user is running gfs_fsck because of corruption, we may
not be able to assume that the journal is good metadata, worthy of
being replayed.

Third, the user always has the option of replaying the journal before
doing gfs_fsck:

1. mount the file system after the crash (to replay the journal)
2. unmount the file system
3. run gfs_fsck

The decision to have gfs_fsck clear the journals was probably made
many years ago, before gfs was stable, and these "withdraw" situations
were more common.

Some people believe that this was a bad decision.  I believe
that it makes more sense to trust the journal and replay it before
doing the rest of the fsck operations because in "normal" cases where
a node dies (often for some reason unrelated to gfs, like getting
shot, fenced, losing power, blowing up a power supply, etc.) you have
the potential to lose metadata unless the journal is replayed.

Other journaling file systems replay their journals during fsck
or else they inform the user, ask them to take steps to replay
the journal (as above), or give them the option to clear them, etc.
So far, gfs_fsck does not do that.  It just clears the journals.

To remedy the situation, I've got an open bugzilla 291551 (which
may be marked "private" because it was opened internally--sorry)
at least in the gfs2_fsck case.  (gfs_fsck will likely be done too).
With that bugzilla, I intend to somehow remedy the situation.
Either I'll ask the user if they want the journal replayed or else
I'll replay them automatically, or try to detect problems with them.

I'm not certain that this is the cause of your corruption, but it's
the only one I can think of at the moment.

Regards,

Bob Peterson
Red Hat Cluster Suite

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster