On Mon, Nov 08, 2021 at 09:35:20AM -0800, Samuel Mendoza-Jonas wrote: > Based on that what I think is happening is > - A file with separate (i.e. non-inline) extents is synced / written to disk > (in this case, one of the large "compound" files) > - ext4_end_io_end() kicks off writeback of extent metadata > - AIUI this marks the related buffers dirty but does not wait on them in the > no-journal case > - The file is deleted, causing the extents to be "removed" and the blocks where > they were stored are marked unused > - A new file is created (any file, separate extents not required) > - The new file is allocated the block that was just freed (the physical block > where the old extents were located) > > Some time between this point and when the file is next read, the dirty extent > buffer hits the disk instead of the intended data for the new file. > A big-hammer hack in __ext4_handle_dirty_metadata() to always sync metadata > blocks appears to avoid the issue but isn't ideal - most likely a better > solution would be to ensure any dirty metadata buffers are synced before the > inode is dropped. > > Overall does this summary sound valid, or have I wandered into the > weeds somewhere? Hmm... well, I can tell you what's *supposed* to happen. When the extent block is freed, ext4_free_blocks() gets called with the EXT4_FREE_BLOCKS_FORGET flag set. ext4_free_blocks() calls ext4_forget() in two places; one when bh passed to ext4_free_blocks() is NULL, and one where it is non-NULL. And then ext4_free_blocks() calls bforget(), which should cause the dirty extent block to get thrown away. This *should* have prevented your failure scenario from taking place, since after the call to bforget() the dirty extent buffer *shouldn't* have hit the disk. If your theory is correct, the somehow either (a) the bforget() wasn't called, or (b) the bforget() didn't work, and then the page writeback for the new page happened first, and then buffer cache writeback happened second, overwriting the intended data for the new file. Have you tried enabling the blktrace tracer in combination with some of the ext4 tracepoints, to see if you can catch the double write happening? Another thing to try would be enabling some tracepoints, such as ext4_forget and ext4_free_blocks. Unfortunately we don't have any tracepoints in fs/ext4/page-io.c to get a tracepoint which includes the physical block ranges coming from the writeback path. And the tracepoints in fs/fs-writeback.c won't have the physical block number (just the inode and logical block numbers). - Ted