Re: e2fsck fails with unable to set superblock

Andreas Dilger <adilger@xxxxxxxxx> · Sun, 26 Jan 2020 14:45:59 -0700

There are backups of all the group descriptors that can be used in such cases, immediately following the backup superblocks. 

Failing that, the group descriptors follow a very regular pattern and could be recreated by hand if needed (eg. all the backups were also corrupted for some reason).

Cheers, Andreas

> On Jan 26, 2020, at 13:44, Jaco Kroon <jaco@xxxxxxxxx> wrote:
> 
> Hi,
> 
> So working through the dumpe2fs file, the group mentioned by dmesg
> contains this:
> 
> Group 404160: (Blocks 13243514880-13243547647) csum 0x9546
>   Group descriptor at 13243514880
>   Block bitmap at 0 (bg #0 + 0), csum 0x00000000
>   Inode bitmap at 0 (bg #0 + 0), csum 0x00000000
>   Inode table at 0-31 (bg #0 + 0)
>   0 free blocks, 0 free inodes, 0 directories
>   Free blocks: 13243514880-13243547647
>   Free inodes: 206929921-206930432
> 
> Based on that it's quite simple to see that during the array
> reconstruction we apparently wiped a bunch of data blocks with all
> zeroes.  This is obviously bad. During reconstruction we had to zero one
> of the disks before we could get the array to reassemble. What I'm
> wondering is whether this process was a good choice now, and whether the
> right disk was zeroed.  Obviously this implies major data loss (at least
> 4TB, probably more assuming that directory structures may well have been
> destroyed as well, maybe less if some of those blocks weren't in use).
> 
> I'm hoping that it's possible to recreate these group descriptors (there
> are a few of them) to at least point to the correct locations on disk,
> and to then attempt a cleanup with e2fsck.  Again, data loss here is to
> be expected, but if we can limit it at least that would be great.
> 
> There are unfortunately a large bunch of groups affected (128 cases of
> 64 consecutive group blocks).
> 
> 32768 blocks/group => 128 * 64 * 32768 blocks => 268m blocks, at
> 4KB/block => 1TB of data lost.  However, this is extremely conservative
> seeing that this could include directory structures with cascading effect.
> 
> Based on the patterns of the first 64 group descriptors (GDs) it looks
> like it should be possible to reconstruct the 8192 affected GDs, or
> alternatively possibly "uninit" them
> (https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Lazy_Block_Group_Initialization). 
> I'm inclined to reason that it's probably safer to repair in the GDs the
> following fields:
> 
> bg_block_bitmap_{lo,hi}
> bg_inode_bitmap_{lo,hi}
> bg_inode_table_{lo,hi}
> 
> I'm not sure about:
> 
> bg_flags (I'm guessing the safest is to leave this zeroed).
> bg_exclude_bitmap_{lo,hi} (I don't know what this is used for).
> 
> The following should (as far as my understanding goes) then be "fixable"
> by e2fsck:
> 
> bg_free_blocks_count_{lo,hi}
> bg_free_inodes_count_{lo,hi}
> bg_used_dirs_count_{lo,hi}
> bg_block_bitmap_csum_{lo,hi}
> bg_inode_bitmap_csum_{lo,hi}
> bg_itable_unused_{lo,hi}
> bg_checksum
> 
> And of course, tracking down the GD on disk will be tricky it seems. It
> seems some blocks have the GD in the block, and a bunch of others don't
> (nor does dumpe2fs say where exactly they are).  There is 2048 blocks of
> GDs (131072 or 2^17 GDs) with every superblock backup, however, rom
> group 2^17 onwards there are additional groups simply stating "Group
> descriptor at ${frist_block_of_group}", so it's unclear how to track
> down the GD for a given block group. 
> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Block_Group_Descriptors
> does not describe this particularly well either, and there seems to be
> confusion w.r.t. flex_bg and meta_bg features and this.
> 
> I do have an LVM snapshot of the affected LV currently, so happy to try
> things.
> 
> Kind Regards,
> Jaco
> 
>> On 2020/01/26 12:21, Jaco Kroon wrote:
>> 
>> Hi,
>> 
>> I've got an 85TB ext4 filesystem which I'm unable to fsck.  The only
>> cases of same error I could find was from what I can find due to an SD
>> card "swallowing" writes (ie, the card goes into a read-only mode but
>> doesn't report write failure).
>> 
>> crowsnest ~ # e2fsck -f /dev/lvm/home
>> 
>> e2fsck 1.45.4 (23-Sep-2019)
>> ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
>> e2fsck: Group descriptors look bad... trying backup blocks...
>> /dev/lvm/home: recovering journal
>> e2fsck: unable to set superblock flags on /dev/lvm/home
>> 
>> 
>> /dev/lvm/home: ***** FILE SYSTEM WAS MODIFIED *****
>> 
>> /dev/lvm/home: ********** WARNING: Filesystem still has errors **********
>> 
>> I have also (using dumpe2fs) obtained the location of the backup super
>> blocks and tried same against a few other superblocks using -b.  -y (as
>> per suggestion from at least one post) make absolutely no difference,
>> our understanding is that this simply answers yes to all questions, so
>> we didn't expect this to have impact but decided it was worth a try anyway.
>> 
>> Looking at the code for the unable to set superblock error it looks like
>> the code is in e2fsck/unix.c, specifically this:
>> 
>> 1765     if (ext2fs_has_feature_journal_needs_recovery(sb)) {
>> 1766         if (ctx->options & E2F_OPT_READONLY) {
>> ...
>> 1771         } else {
>> 1772             if (ctx->flags & E2F_FLAG_RESTARTED) {
>> 1773                 /*
>> 1774                  * Whoops, we attempted to run the
>> 1775                  * journal twice.  This should never
>> 1776                  * happen, unless the hardware or
>> 1777                  * device driver is being bogus.
>> 1778                  */
>> 1779                 com_err(ctx->program_name, 0,
>> 1780                     _("unable to set superblock flags "
>> 1781                       "on %s\n"), ctx->device_name);
>> 1782                 fatal_error(ctx, 0);
>> 1783             }
>> 
>> That comment has me somewhat confused.  I'm assuming the implication
>> there is that e2fsck tried to update the superblock, but after reading
>> it back, it's either unchanged or still wrong (In line with the
>> description of the SD card I found online).  None of our arrays are
>> reflecting R/O in /proc/mdstat. We did pick out this in kernel bootup
>> (we downgraded back to 5.1.15, which we're on currently, after
>> experiencing major performance issues on 5.3.6 and subsequently 5.4.8
>> didn't seem to fix those, and the 4.14.13 kernel that was used
>> previously is known to cause ext4 corruption of the kind we saw on the
>> other filesystems):
>> 
>> [ 3932.271538] EXT4-fs (dm-7): ext4_check_descriptors: Block bitmap for
>> group 404160 overlaps superblock
>> [ 3932.271539] EXT4-fs (dm-7): group descriptors corrupted!
>> 
>> I created a dumpe2fs file as well:
>> 
>> crowsnest ~ # dumpe2fs /dev/lvm/home > /var/tmp/dump2fs_home.txt
>> dumpe2fs 1.45.4 (23-Sep-2019)
>> dumpe2fs: Block bitmap checksum does not match bitmap while trying to
>> read '/dev/lvm/home' bitmaps
>> 
>> Available at https://downloads.uls.co.za/85T/dump2fs_home.txt.xz (1.2GB,
>> md5:79b3250e209c067af2532d5324ff95aa, around 12GB extracted)
>> 
>> A strace of e2fsck -y -f /dev/lvm/home at
>> https://downloads.uls.co.za/85T/fsck.strace.txt (13MB,
>> md5:60aa91b0c47dd2837260218eb774152d)
>> 
>> crowsnest ~ # tune2fs -l /dev/lvm/home
>> tune2fs 1.45.4 (23-Sep-2019)
>> Filesystem volume name:   <none>
>> Last mounted on:          /home
>> Filesystem UUID:          522a9faf-7992-4888-93d5-7fe49a9762d6
>> Filesystem magic number:  0xEF53
>> Filesystem revision #:    1 (dynamic)
>> Filesystem features:      has_journal ext_attr filetype meta_bg extent
>> 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize
>> metadata_csum
>> Filesystem flags:         signed_directory_hash
>> Default mount options:    user_xattr acl
>> Filesystem state:         clean
>> Errors behavior:          Continue
>> Filesystem OS type:       Linux
>> Inode count:              356515840
>> Block count:              22817013760
>> Reserved block count:     0
>> Free blocks:              6874204745
>> Free inodes:              202183498
>> First block:              0
>> Block size:               4096
>> Fragment size:            4096
>> Group descriptor size:    64
>> Blocks per group:         32768
>> Fragments per group:      32768
>> Inodes per group:         512
>> Inode blocks per group:   32
>> RAID stride:              128
>> RAID stripe width:        1024
>> First meta block group:   2048
>> Flex block group size:    16
>> Filesystem created:       Thu Jul 26 12:19:07 2018
>> Last mount time:          Sat Jan 18 18:58:50 2020
>> Last write time:          Sun Jan 26 11:38:56 2020
>> Mount count:              2
>> Maximum mount count:      -1
>> Last checked:             Wed Oct 30 17:37:27 2019
>> Check interval:           0 (<none>)
>> Lifetime writes:          976 TB
>> Reserved blocks uid:      0 (user root)
>> Reserved blocks gid:      0 (group root)
>> First inode:              11
>> Inode size:               256
>> Required extra isize:     32
>> Desired extra isize:      32
>> Journal inode:            8
>> Default directory hash:   half_md4
>> Directory Hash Seed:      876a7d14-bce8-4bef-9569-82e7d573b7aa
>> Journal backup:           inode blocks
>> Checksum type:            crc32c
>> Checksum:                 0xfbd895e9
>> 
>> Infrastructure:  3 x RAID6 arrays, 2 of 12 x 4TB disks, and 1 of 4 x
>> 10TB disks (100TB usable total).  These are combined into a single VG
>> using LVM, and then carved up into a number of LVs, the largest of which
>> is this 85TB chunk.  We have tried in the past to carve this into
>> smaller LVs but failed.  So we're aware that this is very large and not
>> ideal.
>> 
>> We did experience an assembly issue on one of  the underlying RAID6 PVs,
>> those have been resolved, and the disk that was giving issues has been
>> scrubbed and rebuilt.  rom what we can tell based on other file systems,
>> this did not affect data integrity but we can't make that statement with
>> 100% certainty, as such we are expecting some data loss here but it
>> would be better if we can recover at least some of this data.
>> 
>> Other filesystems which also resides on the same PV that was affected by
>> the RAID6 problem either received a clean bill of health, or were
>> successfully repaired by e2fsck (the system did crash however, it's
>> unclear whether the RAID6 assembly problem was the cause or merely
>> another consequence, and as a result, whether the corruption on the
>> repaired filesystem was a consequence of the kernel or the RAID).
>> 
>> I'm continuing onwards with e2fsck code to try and figure this out, am
>> hopeful though that someone could perhaps provide some much needed
>> insight and pointers for me.
>> 
>> Kind Regards,
>> Jaco
>>