Re: another seriously corrupt ext3 -- pesky journal

"Theodore Ts'o" <tytso@xxxxxxx> · Sat, 23 Aug 2003 09:58:14 -0400

On Thu, Aug 21, 2003 at 02:47:06PM -0600, Andreas Dilger wrote:
> On Aug 21, 2003  15:28 -0400, Erez Zadok wrote:
> > In message <20030821190811.GC1040@xxxxxxxxxxxxx>, Mike Fedyk writes:
> > > There's no need to support it in the kernel.  The inode number is kept in
> > > the superblock, and that's updated at mkfs and tune2fs time, not from the
> > > kernel.

Actually, there is one possible reason why we might want to have
kernel support for this --- and that's so that if the root filesystem
is corrupted in this manner, the kernel can automatically fall back to
trying to use the backup journal when it does the journal replay prior
to mounting the root filesystem.

> There are not, AFAICS, two copies of the journal being kept, which would
> require kernel changes and cause an even larger performance hit for ext3.
> 
> Instead, the journal inode number is being kept in all of the backup
> superblocks (I don't think it was in the past).  Secondly, there is a
> new "backup journal inode" (also kept in the superblock + backups),
> which I infer holds a duplicate of the blocks allocated to the journal.

The journal inode number was kept in all of the backup superblocks if
the journal was created using mke2fs and tune2fs.  There was a bug in
e2fsck which was fixed in the patch that I included in my previous
mail message where when e2fsck moved the journal from /.journal to the
hidden journal inode, it didn't write out the changed journal inode
number to the backup superblocks.  

> Having only the inode i_blocks field duplicated in a backup inode means
> that there is no (new) overhead writing to the journal, yet if the journal
> inode itself gets corrupted (very possible because it shares the same disk
> block with the root inode and is right at the beginning of the disk), we
> have a chance to recover the journal data.  As a result, the journal itself
> will very likely have backups of recently-written blocks and can "self heal"
> from all sorts of nasty corruptions.

Correct.  Actually, what's being backed up is the i_block[] array as
well as the i_size field.  It turns out that the i_blocks (number of
blocks) field isn't needed by e2fsck, so I didn't bother backing it
up.  Total cost to the superblock?  64 bytes.  (16 32-bit unsigned
integers.)  

> What would also be needed (not sure if this is implemented or not) is that
> in the case of a corrupt superblock e2fsck assumes "needs_recovery" is set
> if "has_journal" is set and the (backup) journal inode can be read, so that
> the journal replay is actually done.  That will almost always result in the
> primary superblock being restored from somewhere in the journal, along with
> other useful things like bitmaps and such.

Ooh, good point.  Yeah, I definitely need to do that, since if the
primary superblock is trashed, the needs_recovery flag won't be set in
the backup superblocks.  I need to think a bit to make sure there
won't be any potential lossage cases caused by attempting to replay a
journal when it's not necessary, but I don't think there are any.

						- Ted

_______________________________________________

Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users