Re: Debugging ext4 corruption with nojournal & extents

"Theodore Ts'o" <tytso@xxxxxxx> · Mon, 8 Nov 2021 22:14:33 -0500

On Mon, Nov 08, 2021 at 09:35:20AM -0800, Samuel Mendoza-Jonas wrote:
> Based on that what I think is happening is
> - A file with separate (i.e. non-inline) extents is synced / written to disk
>   (in this case, one of the large "compound" files)
> - ext4_end_io_end() kicks off writeback of extent metadata
>   - AIUI this marks the related buffers dirty but does not wait on them in the
>     no-journal case
> - The file is deleted, causing the extents to be "removed" and the blocks where
>   they were stored are marked unused
> - A new file is created (any file, separate extents not required)
> - The new file is allocated the block that was just freed (the physical block
>   where the old extents were located)
> 
> Some time between this point and when the file is next read, the dirty extent
> buffer hits the disk instead of the intended data for the new file.
> A big-hammer hack in __ext4_handle_dirty_metadata() to always sync metadata
> blocks appears to avoid the issue but isn't ideal - most likely a better
> solution would be to ensure any dirty metadata buffers are synced before the
> inode is dropped.
> 
> Overall does this summary sound valid, or have I wandered into the
> weeds somewhere?

Hmm... well, I can tell you what's *supposed* to happen.  When the
extent block is freed, ext4_free_blocks() gets called with the
EXT4_FREE_BLOCKS_FORGET flag set.  ext4_free_blocks() calls
ext4_forget() in two places; one when bh passed to ext4_free_blocks()
is NULL, and one where it is non-NULL.  And then ext4_free_blocks()
calls bforget(), which should cause the dirty extent block to get
thrown away.

This *should* have prevented your failure scenario from taking place,
since after the call to bforget() the dirty extent buffer *shouldn't*
have hit the disk.  If your theory is correct, the somehow either (a)
the bforget() wasn't called, or (b) the bforget() didn't work, and
then the page writeback for the new page happened first, and then
buffer cache writeback happened second, overwriting the intended data
for the new file.

Have you tried enabling the blktrace tracer in combination with some
of the ext4 tracepoints, to see if you can catch the double write
happening?  Another thing to try would be enabling some tracepoints,
such as ext4_forget and ext4_free_blocks.  Unfortunately we don't have
any tracepoints in fs/ext4/page-io.c to get a tracepoint which
includes the physical block ranges coming from the writeback path.
And the tracepoints in fs/fs-writeback.c won't have the physical block
number (just the inode and logical block numbers).

       	     	       	   	   	 - Ted