Debugging ext4 corruption with nojournal & extents

Samuel Mendoza-Jonas <samjonas@xxxxxxxxxx> · Mon, 8 Nov 2021 09:35:20 -0800

Hi all,

Recently I've been digging into a corruption issue which I think is just about
pinned, but I'd appreciate some more expert EXT4 eyes to confirm we're on the
right path.

What we have boils down to a system with
- An ext4 filesystem with the journal disabled
- A workload[0] which in a loop
  - Creates a lot of small files
  - Occasionally deletes these files and collects them into a single larger "compound" file
  - Checks the header of all of these files periodically to ensure they're correct

After a while this check fails, and when inspecting the "bad" file, the contents of that file are actually an EXT4 extent structure, for example:

[ec2-user@ip-172-31-0-206 ~]$ hexdump -C _2w.si
00000000  0a f3 05 00 54 01 00 00  00 00 00 00 00 00 00 00  |....T...........|
00000010  01 00 00 00 63 84 08 05  01 00 00 00 ff 01 00 00  |....c...........|
00000020  75 8a 1c 02 00 02 00 00  00 02 00 00 00 9c 1c 02  |u...............|
00000030  00 04 00 00 dc 00 00 00  00 ac 1c 02 dc 04 00 00  |................|
00000040  08 81 00 00 dc ac 1c 02  00 00 00 00 00 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000170  00 00 00                                          |...|
00000173

This has EXT4_EXT_MAGIC (cpu_to_le16(0xf30a)), and when parsed as extent header
plus array has 5 extent entries at 0 depth.
By the time the file is checked, the file that these extents presumably pointed
to appears to have been deleted, but reading the physical blocks looks like the
data of one of the larger files this test creates.

Based on that what I think is happening is
- A file with separate (i.e. non-inline) extents is synced / written to disk
  (in this case, one of the large "compound" files)
- ext4_end_io_end() kicks off writeback of extent metadata
  - AIUI this marks the related buffers dirty but does not wait on them in the
    no-journal case
- The file is deleted, causing the extents to be "removed" and the blocks where
  they were stored are marked unused
- A new file is created (any file, separate extents not required)
- The new file is allocated the block that was just freed (the physical block
  where the old extents were located)

Some time between this point and when the file is next read, the dirty extent
buffer hits the disk instead of the intended data for the new file.
A big-hammer hack in __ext4_handle_dirty_metadata() to always sync metadata
blocks appears to avoid the issue but isn't ideal - most likely a better
solution would be to ensure any dirty metadata buffers are synced before the
inode is dropped.

Overall does this summary sound valid, or have I wandered into the weeds somewhere?

Cheers,
Sam Mendoza-Jonas

[0] This is an Elastisearch/Lucene workload, running the esrally tests to hit the issue.