allowing ext4 file systems that wrapped inode count to continue working

Jaco Kroon <jaco@xxxxxxxxx> · Tue, 24 Jul 2018 17:00:04 +0200

Hi,

Related to https://www.spinics.net/lists/linux-ext4/msg61075.html (and
possibly the cause of the the work from Jan in that patch series).

I have a 64TB (exactly) filesystem.

Filesystem OS type:       Linux
Inode count:              4294967295
Block count:              17179869184
Reserved block count:     689862348
Free blocks:              16910075355
Free inodes:              4294966285
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
RAID stride:              128
RAID stripe width:        128
First meta block group:   1152
Flex block group size:    16

Note that in the above Inode count == 2^32-1 instead of the expected 2^32.

This results in the correct inode count being exactly 2^32 (which
overflows to 0).  A kernel bug (fixed by Jan) allowed this overflow in
the first place.

I'm busy trying to write a patch for e2fsck that would allow it to (on
top of the referenced series by Jan) enable fsck to at least clear the
filesystem from other errors where currently if I hack the inode count
to ~0U fsck, tune2fs and friends fail.

With the attached patch (sorry, Thunderbird breaks my inlining of
patches) tune2fs operates (-l at least) as expected, and fsck gets to
pass5 where it segfaults with the following stack trace (compiled with -O0):

/dev/exp/exp contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

Program received signal SIGSEGV, Segmentation fault.
0x00005555555ac8d1 in ext2fs_bg_flags_test (fs=0x555555811e90,
group=552320, bg_flag=1)
    at blknum.c:445
445             return gdp->bg_flags & bg_flag;
(gdb) bt
#0  0x00005555555ac8d1 in ext2fs_bg_flags_test (fs=0x555555811e90,
group=552320, bg_flag=1)
    at blknum.c:445
#1  0x000055555558c343 in check_inode_bitmaps (ctx=0x5555558112b0) at
pass5.c:759
#2  0x000055555558a251 in e2fsck_pass5 (ctx=0x5555558112b0) at pass5.c:57
#3  0x000055555556fb48 in e2fsck_run (ctx=0x5555558112b0) at e2fsck.c:249
#4  0x000055555556e849 in main (argc=5, argv=0x7fffffffdfe8) at unix.c:1859
(gdb) print *gdp
$1 = {bg_block_bitmap = 528400, bg_inode_bitmap = 0, bg_inode_table =
528456,
  bg_free_blocks_count = 0, bg_free_inodes_count = 0, bg_used_dirs_count
= 4000, bg_flags = 8,
  bg_exclude_bitmap_lo = 0, bg_block_bitmap_csum_lo = 0,
bg_inode_bitmap_csum_lo = 8,
  bg_itable_unused = 0, bg_checksum = 0, bg_block_bitmap_hi = 528344,
bg_inode_bitmap_hi = 0,
  bg_inode_table_hi = 528512, bg_free_blocks_count_hi = 0,
bg_free_inodes_count_hi = 0,
  bg_used_dirs_count_hi = 4280, bg_itable_unused_hi = 8,
bg_exclude_bitmap_hi = 0,
  bg_block_bitmap_csum_hi = 0, bg_inode_bitmap_csum_hi = 0, bg_reserved = 0}

... so I'm not sure why it even segfaults.  gdb can retrieve a value of
8 for bg_flags ... and yet, if the code does that it segfaults.  So not
sure what the discrepancy is there - probably a misunderstanding of
what's going wrong, but the only thing I can see that can segfault is
the gdp dereference, and since that seems to be a valid pointer ...

I am not sure if this is a separate issue, or due to me tampering with
the inode counter in the way that I am (I have to assume the latter). 
For testing I created a thin volume (1TB) in a separate environment,
where I created a 16TB filesystem initially, and then expanded that to
64TB, resulting in exactly the same symptoms we saw in production
environment.  I created a thousand empty files in the root folder.  The
filesystem is consuming 100GB on-disk currently in the thin volume. 
Note that group=552320 > 524288 (17179869184 / 32768).

Regarding further expansion, would appreciate some advise, there are two
(three) possible options that I could come up with:

1.  Find a way to reduce the number of inodes per group (say to 4096,
which would require re-allocating all inodes >= 2^31 to inodes <2^31).

2.  Allow to add additional blocks to the filesystem, without adding
additional inodes.

(3. Find some free space, create a new filesystem, and iteratively move
data from the one to the other, shrinking and growing the filesystems as
per progress - will never be able to move more data that what is
curently available on the system, around 4TB in my case, so will take a
VERY long time).

I'm currently aiming for option 2 since that looks to be the simplest. 
Simply allow overflow to happen, but don't allocate additional inodes if
number of inodes is already ~0U.

Kind Regards,
Jaco

>From b9fb5efebee024a53656fe063bf5a0ccf349401c Mon Sep 17 00:00:00 2001
From: Jaco Kroon <jaco@xxxxxxxxx>
Date: Tue, 24 Jul 2018 13:18:07 +0200
Subject: [PATCH] Allow opening a filesystem with maxed out inode count.

In some scenarios if we resized a filesystem over the limit it could
happen that inode count wrapped over, typically to 0.

This change would enable us to set the number of inodes on the
filesystem to 2^32-1 in order to enable continuing to use the
filesystem.

This change is required to allow fsck to "clear" the filesystem in order
to operate on it.
---
 e2fsck/super.c      | 5 +----
 lib/ext2fs/openfs.c | 5 +++--
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/e2fsck/super.c b/e2fsck/super.c
index eb7ab0d1..3f1a219f 100644
--- a/e2fsck/super.c
+++ b/e2fsck/super.c
@@ -665,10 +665,7 @@ void check_super_block(e2fsck_t ctx)
 
 	should_be = (__u64)sb->s_inodes_per_group * fs->group_desc_count;
 	if (should_be > ~0U) {
-		pctx.num = should_be;
-		fix_problem(ctx, PR_0_INODE_COUNT_BIG, &pctx);
-		ctx->flags |= E2F_FLAG_ABORT;
-		return;
+		should_be = ~0U;
 	}
 	if (sb->s_inodes_count != should_be) {
 		pctx.ino = sb->s_inodes_count;
diff --git a/lib/ext2fs/openfs.c b/lib/ext2fs/openfs.c
index 85d73e2a..a7a343f9 100644
--- a/lib/ext2fs/openfs.c
+++ b/lib/ext2fs/openfs.c
@@ -129,6 +129,7 @@ errcode_t ext2fs_open2(const char *name, const char *io_options,
 	int		group_zero_adjust = 0;
 	unsigned int	inode_size;
 	__u64		groups_cnt;
+	__u64		inode_cnt;
 #ifdef WORDS_BIGENDIAN
 	unsigned int	groups_per_block;
 	struct ext2_group_desc *gdp;
@@ -386,9 +387,9 @@ errcode_t ext2fs_open2(const char *name, const char *io_options,
 		goto cleanup;
 	}
 	fs->group_desc_count = 	groups_cnt;
+	inode_cnt = (__u64)fs->group_desc_count * EXT2_INODES_PER_GROUP(fs->super);
 	if (!(flags & EXT2_FLAG_IGNORE_SB_ERRORS) &&
-	    (__u64)fs->group_desc_count * EXT2_INODES_PER_GROUP(fs->super) !=
-	    fs->super->s_inodes_count) {
+	    (inode_cnt < (1ULL<<32) ? inode_cnt : ~0U) != fs->super->s_inodes_count) {
 		retval = EXT2_ET_CORRUPT_SUPERBLOCK;
 		goto cleanup;
 	}
-- 
2.16.4