RE: Found unlinked inode

Borgström Jonas <jobot@xxxxxxxxxx> · Wed, 26 Sep 2007 17:40:59 +0200

Hi again,

I was just able to reproduce the filesystem corruption again. This time four lost zero-sized inodes were found :(
And unfortunately mounting+umounting the filesystem didn't make the lost inodes go away. 
I still have a copy of the corrupted filesystem if there is any more things you want me to test.

Here's the gfs_fsck output:

[root@test-db2 ~]# gfs_fsck -n /dev/testdb/pg_fs 
Initializing fsck
Starting pass1
Pass1 complete      
Starting pass1b
Pass1b complete      
Starting pass1c
Pass1c complete      
Starting pass2
Pass2 complete      
Starting pass3
Pass3 complete      
Starting pass4
Found unlinked inode at 1706623
Unlinked inode has zero size
Unlinked inode left unlinked
Found unlinked inode at 1706620
Unlinked inode left unlinked
Found unlinked inode at 1706622
Unlinked inode has zero size
Unlinked inode left unlinked
Found unlinked inode at 1706621
Unlinked inode has zero size
Unlinked inode left unlinked
Pass4 complete      
Starting pass5
Converting 8457 unused metadata blocks to free data blocks...
Converting 61490 unused metadata blocks to free data blocks...
...
...

Regards,
Jonas

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Borgström Jonas
Sent: den 26 september 2007 16:50
To: rpeterso@xxxxxxxxxx; linux clustering
Subject: RE:  Found unlinked inode

From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Bob Peterson
Sent: den 26 september 2007 16:01
To: linux clustering
Subject: Re:  Found unlinked inode

> Hi Jonas,
>
> Well, I can think of one possible explanation.  I can't be sure because
> I don't know your test scenario, but this is my theory.  First, a bit
> of background:
>
> When a node gets "shot" as you say, the metadata for some of its
> recent operations are likely to still be in the journal it was using.
> Depending on the circumstances of when it gets shot, that data may
> exist only in the journal if it got "shot" before the data was
> written to its final destination on disk.
>
> Ordinarily, that's not a big deal because the next time the file
> system is mounted, the journal is replayed and that causes the metadata
> to be written correctly to its proper place on disk and all is well.
> That's the same for most journaling file systems afaik.
>
> A couple years ago, one of my predecessors (before I started) made an
> executive decision to make gfs_fsck *clear* the system journals rather
> than replay the journals.  I don't know offhand if the code was once
> there and got taken out or if it was never written.  At any rate, it
> seemed like a good idea at the time and there were several good
> reasons to justify that decision:
>
> First, if the user is running gfs_fsck, they must already suspect
> file system corruption.  If (and this is a big if) that corruption was
> caused by recent operations to the file system, then replaying the
> journal can only serve to compound the corruption and cause more
> corruption.  That's because what is in the journal may also be based on
> the corruption.  This was more of a concern if, for some reason,
> GFS bailed out and "withdrew" from the file system because it detected
> corruption, suspecting that it must have somehow caused that corruption.
>
> Second, if the user is running gfs_fsck because of corruption, we may
> not be able to assume that the journal is good metadata, worthy of
> being replayed.
>
> Third, the user always has the option of replaying the journal before
> doing gfs_fsck:
>
> 1. mount the file system after the crash (to replay the journal)
> 2. unmount the file system
> 3. run gfs_fsck
>
> The decision to have gfs_fsck clear the journals was probably made
> many years ago, before gfs was stable, and these "withdraw" situations
> were more common.
>
> Some people believe that this was a bad decision.  I believe
> that it makes more sense to trust the journal and replay it before
> doing the rest of the fsck operations because in "normal" cases where
> a node dies (often for some reason unrelated to gfs, like getting
> shot, fenced, losing power, blowing up a power supply, etc.) you have
> the potential to lose metadata unless the journal is replayed.
>
> Other journaling file systems replay their journals during fsck
> or else they inform the user, ask them to take steps to replay
> the journal (as above), or give them the option to clear them, etc.
> So far, gfs_fsck does not do that.  It just clears the journals.
>
> To remedy the situation, I've got an open bugzilla 291551 (which
> may be marked "private" because it was opened internally--sorry)
> at least in the gfs2_fsck case.  (gfs_fsck will likely be done too).
> With that bugzilla, I intend to somehow remedy the situation.
> Either I'll ask the user if they want the journal replayed or else
> I'll replay them automatically, or try to detect problems with them.
>
> I'm not certain that this is the cause of your corruption, but it's
> the only one I can think of at the moment.
>
Hi Bob,

This sounds like a reasonable explanation except for one thing, the filesystem was cleanly umounted on both nodes before I ran gfs_fsck. So there shouldn't be any journal to replay, right?

Anyway, I've restarted the test and if I'm able to recreate this error I'll first take a copy of the filesystem and then check if running "mount + umount" makes this gfs_fsck error go away.

Regards,
Jonas
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster