Re: [PATCH 1/2] ext4: only defer sb update on error if SB_ACTIVE

Baokun Li <libaokun1@xxxxxxxxxx> · Thu, 27 Feb 2025 20:51:11 +0800

On 2025/2/27 19:43, Ojaswin Mujoo wrote:
On Wed, Feb 26, 2025 at 09:55:52AM +0800, Baokun Li wrote:
On 2025/2/25 20:06, Ojaswin Mujoo wrote:
On Tue, Feb 25, 2025 at 09:53:10AM +0800, Baokun Li wrote:
On 2025/2/22 16:40, Ojaswin Mujoo wrote:
Presently we always BUG_ON if trying to start a transaction on a journal
marked with JBD2_UNMOUNT, since this should never happen. However while
running stress tests it was observed that in case of some error handling
paths, it is possible for update_super_work to start a transaction after
the journal is destroyed eg:

(umount)
ext4_kill_sb
     kill_block_super
       generic_shutdown_super
         sync_filesystem /* commits all txns */
         evict_inodes
           /* might start a new txn */
         ext4_put_super
	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
           jbd2_journal_destroy
             journal_kill_thread
               journal->j_flags |= JBD2_UNMOUNT;
             jbd2_journal_commit_transaction
               jbd2_journal_get_descriptor_buffer
                 jbd2_journal_bmap
                   ext4_journal_bmap
                     ext4_map_blocks
                       ...
                       ext4_inode_error
Just curious, since jbd2_journal_bmap() only queries the map and does not
create it, how does it fail here? Is there more information in dmesg?
Is s_journal_inum normal after file system corruption?
Hey Baokun,
Hello Ojaswin,

Thanks for your detailed explanation!
So I dug a bit more into the vmcore. The error information in sbi looks
like this:

    s_add_error_count = 1,
    s_first_error_code = 117,
    s_first_error_line = 475,
    s_first_error_ino = 0,
    s_first_error_block = 0,
    s_first_error_func = 0xc0080000055300d0 <__func__.6> "ext4_read_block_bitmap_nowait",
    s_first_error_time = 1737023235,

    s_last_error_code = 117,
    s_last_error_line = 609,
    s_last_error_ino = 8,
    s_last_error_block = 783,
    s_last_error_func = 0xc008000005531b10 <__func__.41> "ext4_map_blocks",
    s_last_error_time = 1737023236,

    The first error is here:

        if ((bitmap_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
       474               (bitmap_blk >= ext4_blocks_count(sbi->s_es))) {
    *  475                   ext4_error(sb, "Invalid block bitmap block %llu in "
       476                              "block_group %u", bitmap_blk, block_group);
       477                   ext4_mark_group_bitmap_corrupted(sb, block_group,
       478                                           EXT4_GROUP_INFO_BBITMAP_CORRUPT);
       479                   return ERR_PTR(-EFSCORRUPTED);
       480           }

and the last error is here:

      608           if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
   *  609                   ret = check_block_validity(inode, map);
      610                   if (ret != 0)
      611                           return ret;
      612           }

And indeed we have the traces of the first error in dmesg:

[75284.713463] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
[75284.713470] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0
[75284.713476] EXT4-fs error (device loop36): ext4_read_block_bitmap_nowait:475: comm proc01: Invalid block bitmap block 0 in block_group 0

However, the last error seems strange. It seems like check_block_validity
should ideally never fail for a journal inode.Unfortunately, sbi->s_es page is
not recorded in the crash dump for some reason so idk the exact value at the
time of the check, but looking in journal->j_inode->i_ino, the inode num is 8,
which seems fine to me. So yeah, I'm a bit unsure what caused the corruption.
I'll look a bit more into the proc01 ltp to see if we can recreate the failure
to get more info.
Right, check_block_validity() skips the journal inode check. If
the journal inode check fails, that shows s_es->s_journal_inum and
journal->j_inode->i_ino are different. The file system doesn't modify
s_journal_inum, so it should be modified by some other writer bypassing
the file system (i.e. writing to bare disk).

If that's how it is, we could avoid this issue by using EXT4_JOURNAL_INO
directly or saving s_journal_inum to ext4_sb_info (which offers better
compatibility).
Hey Baokun,

So I'm wondering if modifying the check in __check_block_validity the
right thing to do. The fact that something has changed the on disk
journal inode number is itself concerning and maybe stopping IO here is
the right thing to do instead of using a cached journal inode number and
silently ignoring the issue.
Because the cached journal inode is fine (it was checked when mounting),
and the data we're committing is good too, I think it's okay to keep
committing.  The actual problem is someone modified the superblock
directly, bypassing the file system, and the file system can't protect
against that.

Anyways now that the journal inode is corrupted on disk, will the
recovery tooling even be able to read the journal anymore?
Actually, the journal inode on disk could still be okay. This just tells us
s_es->s_journal_inum is abnormal, meaning part of the superblock on disk
got changed. If only the superblock was modified, we can use the backup
superblock or grab the latest superblock copy from the journal.

Cheers,
Baokun