Re: fsck infinite loop on corrupt ext4 file system

Theodore Tso <tytso@xxxxxxx> · Tue, 18 Aug 2009 12:01:55 -0400

On Mon, Aug 17, 2009 at 06:10:22PM -0700, Frank Mayhar wrote:
> It's clear that fsck is neither correcting the block groups nor is it
> detecting the bad entries properly (a sanity check might be in order
> here).  It's not even noticing that it's looping, it just keeps failing
> the allocation and retrying.  While it may be that fsck can't recover
> the file system in this case, it should at least notice and abort.
> 
> My thinking is that the location of the inode tables should be invariant
> over the life of the file system.  Certainly there's no place in ext4
> itself that changes those fields (that I can see, anyway).  Why couldn't
> fsck compute the proper values and compare those against what's there?

So there are a couple of things going on here.  The first is that the
code which tries to allocate new inode/block allocation bitmaps or
inode tables wasn't taught that filesystems with the FLEX_BG feature
should have the metadata located at the beginning of the
flex-blockgroup, but if we can't find space for it there (allocating
the inode table is tricky since it requires possibly up to a few
hundred contiguous free blocks), we should try to find the space
anywhere in the filesystem.  If it can't find the space, we should
indeed abort.  Please find attached a patch which should fix e2fsck to
handle this case correctly.  Could you test it and let me know if it
works correctly?

As far as assuming the inode tables are invariant over the life of the
filesystem --- this is normally true, but inode tables can be located
in places other than the default; for example if bad blocks located
where the inode tables should be, then the inode tables can be pushed
to non-standard locations.  So this makes calculating where the inode
table "should" be a little tricky, especially since the contents of
the bad blocks can change after the filesystem is formatted.

In addition, e2fsck tries very hard not to destroy data, and so there
is the question of what to do if there are data blocks located where
the inode table "should" be.  In theory e2fsck should be able to move
the inode data blocks elsewhere, or if there is no space, potentially
the offer to delete a user file to make room for the inode table ---
after all, better sacrifice one or two data files rather than lose
potentially several hundred or thousand files.  But this is a level of
complexity that I never had a chance to add to e2fsck, and in truth
the case where we run into this level of lossage is very rare.

After all, most of the time we have so many copies of the block group
descriptors, and the backup group descripts are rarely written, so
most of the time this level of corruption should be quite rare.
Making e2fsck smarter to deal with the most extreme cases of loss is
therefore desirable, but it's always been a "nice to have".

In any case, with ext4 and the flex_bg feature, the ability to
allocate the inode table anywhere in the filesystem should make the
case where the really complex recovery code even more rarely required.

Please try this patch and see if it fixes things up for you or not.

Thanks!!

						- Ted

diff --git a/e2fsck/pass1.c b/e2fsck/pass1.c
index 518c2ff..203468b 100644
--- a/e2fsck/pass1.c
+++ b/e2fsck/pass1.c
@@ -2376,9 +2376,10 @@ static void new_table_block(e2fsck_t ctx, blk_t first_block, int group,
 			    const char *name, int num, blk_t *new_block)
 {
 	ext2_filsys fs = ctx->fs;
+	dgrp_t		last_grp;
 	blk_t		old_block = *new_block;
 	blk_t		last_block;
-	int		i;
+	int		i, is_flexbg, flexbg, flexbg_size;
 	char		*buf;
 	struct problem_context	pctx;
 
@@ -2388,19 +2389,44 @@ static void new_table_block(e2fsck_t ctx, blk_t first_block, int group,
 	pctx.blk = old_block;
 	pctx.str = name;
 
-	last_block = ext2fs_group_last_block(fs, group);
+	/*
+	 * For flex_bg filesystems, first try to allocate the metadata
+	 * within the flex_bg, and if that fails then try finding the
+	 * space anywhere in the filesystem.
+	 */
+	is_flexbg = EXT2_HAS_INCOMPAT_FEATURE(fs->super,
+					      EXT4_FEATURE_INCOMPAT_FLEX_BG);
+	if (is_flexbg) {
+		flexbg_size = 1 << fs->super->s_log_groups_per_flex;
+		flexbg = group / flexbg_size;
+		first_block = ext2fs_group_first_block(fs,
+						       flexbg_size * flexbg);
+		last_grp = group | (flexbg_size - 1);
+		if (last_grp > fs->group_desc_count)
+			last_grp = fs->group_desc_count;
+		last_block = ext2fs_group_last_block(fs, last_grp);
+	} else
+		last_block = ext2fs_group_last_block(fs, group);
 	pctx.errcode = ext2fs_get_free_blocks(fs, first_block, last_block,
-					num, ctx->block_found_map, new_block);
+					      num, ctx->block_found_map,
+					      new_block);
+	if (is_flexbg && (pctx.errcode = EXT2_ET_BLOCK_ALLOC_FAIL))
+		pctx.errcode = ext2fs_get_free_blocks(fs,
+				fs->super->s_first_data_block,
+				fs->super->s_blocks_count,
+				num, ctx->block_found_map, new_block);
 	if (pctx.errcode) {
 		pctx.num = num;
 		fix_problem(ctx, PR_1_RELOC_BLOCK_ALLOCATE, &pctx);
 		ext2fs_unmark_valid(fs);
+		ctx->flags |= E2F_FLAG_ABORT;
 		return;
 	}
 	pctx.errcode = ext2fs_get_mem(fs->blocksize, &buf);
 	if (pctx.errcode) {
 		fix_problem(ctx, PR_1_RELOC_MEMORY_ALLOCATE, &pctx);
 		ext2fs_unmark_valid(fs);
+		ctx->flags |= E2F_FLAG_ABORT;
 		return;
 	}
 	ext2fs_mark_super_dirty(fs);
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html