Hi, Related to https://www.spinics.net/lists/linux-ext4/msg61075.html (and possibly the cause of the the work from Jan in that patch series). I have a 64TB (exactly) filesystem. Filesystem OS type: Linux Inode count: 4294967295 Block count: 17179869184 Reserved block count: 689862348 Free blocks: 16910075355 Free inodes: 4294966285 First block: 0 Block size: 4096 Fragment size: 4096 Group descriptor size: 64 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stride: 128 RAID stripe width: 128 First meta block group: 1152 Flex block group size: 16 Note that in the above Inode count == 2^32-1 instead of the expected 2^32. This results in the correct inode count being exactly 2^32 (which overflows to 0). A kernel bug (fixed by Jan) allowed this overflow in the first place. I'm busy trying to write a patch for e2fsck that would allow it to (on top of the referenced series by Jan) enable fsck to at least clear the filesystem from other errors where currently if I hack the inode count to ~0U fsck, tune2fs and friends fail. With the attached patch (sorry, Thunderbird breaks my inlining of patches) tune2fs operates (-l at least) as expected, and fsck gets to pass5 where it segfaults with the following stack trace (compiled with -O0): /dev/exp/exp contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Program received signal SIGSEGV, Segmentation fault. 0x00005555555ac8d1 in ext2fs_bg_flags_test (fs=0x555555811e90, group=552320, bg_flag=1) at blknum.c:445 445 return gdp->bg_flags & bg_flag; (gdb) bt #0 0x00005555555ac8d1 in ext2fs_bg_flags_test (fs=0x555555811e90, group=552320, bg_flag=1) at blknum.c:445 #1 0x000055555558c343 in check_inode_bitmaps (ctx=0x5555558112b0) at pass5.c:759 #2 0x000055555558a251 in e2fsck_pass5 (ctx=0x5555558112b0) at pass5.c:57 #3 0x000055555556fb48 in e2fsck_run (ctx=0x5555558112b0) at e2fsck.c:249 #4 0x000055555556e849 in main (argc=5, argv=0x7fffffffdfe8) at unix.c:1859 (gdb) print *gdp $1 = {bg_block_bitmap = 528400, bg_inode_bitmap = 0, bg_inode_table = 528456, bg_free_blocks_count = 0, bg_free_inodes_count = 0, bg_used_dirs_count = 4000, bg_flags = 8, bg_exclude_bitmap_lo = 0, bg_block_bitmap_csum_lo = 0, bg_inode_bitmap_csum_lo = 8, bg_itable_unused = 0, bg_checksum = 0, bg_block_bitmap_hi = 528344, bg_inode_bitmap_hi = 0, bg_inode_table_hi = 528512, bg_free_blocks_count_hi = 0, bg_free_inodes_count_hi = 0, bg_used_dirs_count_hi = 4280, bg_itable_unused_hi = 8, bg_exclude_bitmap_hi = 0, bg_block_bitmap_csum_hi = 0, bg_inode_bitmap_csum_hi = 0, bg_reserved = 0} ... so I'm not sure why it even segfaults. gdb can retrieve a value of 8 for bg_flags ... and yet, if the code does that it segfaults. So not sure what the discrepancy is there - probably a misunderstanding of what's going wrong, but the only thing I can see that can segfault is the gdp dereference, and since that seems to be a valid pointer ... I am not sure if this is a separate issue, or due to me tampering with the inode counter in the way that I am (I have to assume the latter). For testing I created a thin volume (1TB) in a separate environment, where I created a 16TB filesystem initially, and then expanded that to 64TB, resulting in exactly the same symptoms we saw in production environment. I created a thousand empty files in the root folder. The filesystem is consuming 100GB on-disk currently in the thin volume. Note that group=552320 > 524288 (17179869184 / 32768). Regarding further expansion, would appreciate some advise, there are two (three) possible options that I could come up with: 1. Find a way to reduce the number of inodes per group (say to 4096, which would require re-allocating all inodes >= 2^31 to inodes <2^31). 2. Allow to add additional blocks to the filesystem, without adding additional inodes. (3. Find some free space, create a new filesystem, and iteratively move data from the one to the other, shrinking and growing the filesystems as per progress - will never be able to move more data that what is curently available on the system, around 4TB in my case, so will take a VERY long time). I'm currently aiming for option 2 since that looks to be the simplest. Simply allow overflow to happen, but don't allocate additional inodes if number of inodes is already ~0U. Kind Regards, Jaco
>From b9fb5efebee024a53656fe063bf5a0ccf349401c Mon Sep 17 00:00:00 2001 From: Jaco Kroon <jaco@xxxxxxxxx> Date: Tue, 24 Jul 2018 13:18:07 +0200 Subject: [PATCH] Allow opening a filesystem with maxed out inode count. In some scenarios if we resized a filesystem over the limit it could happen that inode count wrapped over, typically to 0. This change would enable us to set the number of inodes on the filesystem to 2^32-1 in order to enable continuing to use the filesystem. This change is required to allow fsck to "clear" the filesystem in order to operate on it. --- e2fsck/super.c | 5 +---- lib/ext2fs/openfs.c | 5 +++-- 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/e2fsck/super.c b/e2fsck/super.c index eb7ab0d1..3f1a219f 100644 --- a/e2fsck/super.c +++ b/e2fsck/super.c @@ -665,10 +665,7 @@ void check_super_block(e2fsck_t ctx) should_be = (__u64)sb->s_inodes_per_group * fs->group_desc_count; if (should_be > ~0U) { - pctx.num = should_be; - fix_problem(ctx, PR_0_INODE_COUNT_BIG, &pctx); - ctx->flags |= E2F_FLAG_ABORT; - return; + should_be = ~0U; } if (sb->s_inodes_count != should_be) { pctx.ino = sb->s_inodes_count; diff --git a/lib/ext2fs/openfs.c b/lib/ext2fs/openfs.c index 85d73e2a..a7a343f9 100644 --- a/lib/ext2fs/openfs.c +++ b/lib/ext2fs/openfs.c @@ -129,6 +129,7 @@ errcode_t ext2fs_open2(const char *name, const char *io_options, int group_zero_adjust = 0; unsigned int inode_size; __u64 groups_cnt; + __u64 inode_cnt; #ifdef WORDS_BIGENDIAN unsigned int groups_per_block; struct ext2_group_desc *gdp; @@ -386,9 +387,9 @@ errcode_t ext2fs_open2(const char *name, const char *io_options, goto cleanup; } fs->group_desc_count = groups_cnt; + inode_cnt = (__u64)fs->group_desc_count * EXT2_INODES_PER_GROUP(fs->super); if (!(flags & EXT2_FLAG_IGNORE_SB_ERRORS) && - (__u64)fs->group_desc_count * EXT2_INODES_PER_GROUP(fs->super) != - fs->super->s_inodes_count) { + (inode_cnt < (1ULL<<32) ? inode_cnt : ~0U) != fs->super->s_inodes_count) { retval = EXT2_ET_CORRUPT_SUPERBLOCK; goto cleanup; } -- 2.16.4