On Fri, Aug 23, 2019 at 08:55:54PM -0700, Darrick J. Wong wrote: > I'm probably misunderstanding the ext4 extent cache horribly, but I keep > wondering why any of this is necessary -- why can't ext4 track the > unwritten status in the extent records directly? And why is there all > this strange "can merge" logic? If you need to convert blocks X to Y > to written state because a write to those blocks completed, isn't that > just manipulation of a bunch of incore records? And can't you just seek > back and forth in the extent cache to look for adjacent records to merge > with? <confuseD> Same here. I'm not an ext4 expert, but here is what we do in XFS, which hopefully works in some form for ext4 a well: - when starting a direct I/O we allocate any needed blocks and do so as unwritten extent. The extent tree code will merge them in whatever way that seems suitable - if the IOMAP_DIO_UNWRITTEN is set on the iomap at ->end_io time we call a function that walks the whole range covered by the ioend, and convert any unwritten extent to a normal written extent. Any splitting and merging will be done as needed by the low-level extent tree code - this also means we don't need the xfs_ioen structure (which ext4) copied from for direct I/O at all (we used to have it initially, though including the time when ext4 copied this code). - we don't need the equivalent to the ext4_unwritten_wait call in ext4_file_write_iter because we serialize any non-aligned I/O instead of trying to optimize for weird corner cases > (I'd really prefer not to go adding private fields all over the > place...) Agreed.