On 2022/05/03 15:39, Matthew Wilcox (Oracle) wrote: > This is very much in development and basically untested, but Damian s/Damian/Damien :) Thank you for posting this. I am definitely going to play with this with zonefs. The goal is to allow replacing the mandatory O_DIRECT writing of sequential zone files with sector aligned O_SYNC writes which "preload" the page cache for subsequent buffered reads, thus reducing device accesses. That will also avoid an annoying overhead with zonefs which is that applications need 2 file descriptors per zone file: one without O_DIRECT for buffered reads and another O_DIRECT one for writes. In the case of zonefs, since all sequential files are always fully mapped, allocated, cannot be used for mmap writing *and* a write is never an overwrite, these conditions: + if (folio_test_dirty(folio)) + return true; + /* Can't allocate blocks here because we don't have ->prepare_ioend */ + if (iomap->type != IOMAP_MAPPED || iomap->type != IOMAP_UNWRITTEN || + iomap->flags & IOMAP_F_SHARED) + return false; never trigger and the writethrough is always started with folio_start_writeback(), essentially becoming a "direct" write from the issuer context (under the inode lock) on the entire folio. And that should guarantee that writes stay sequential as they must. > started describing to me something that he wanted, and I told him he > was asking for the wrong thing, and I already had this patch series > in progress. If someone wants to pick it up and make it mergable, > that'd be grand. > > The idea is that an O_SYNC write is always going to want to write, and > we know that at the time we're storing into the page cache. So for an > otherwise clean folio, we can skip the part where we dirty the folio, > find the dirty folios and wait for their writeback. We can just mark the > folio as writeback-in-progress and start the IO there and then (where we > know exactly which blocks need to be written, so possibly a smaller I/O > than writing the entire page). The existing "find dirty pages, start > I/O and wait on them" code will end up waiting on this pre-started I/O > to complete, even though it didn't start any of its own I/O. > > The important part is patch 9. Everything before it is boring prep work. > I'm in two minds about whether to keep the 'write_through' bool, or > remove it. So feel to read patches 9+10 squashed together, or as if > patch 10 doesn't exist. Whichever feels better. > > The biggest problem with all this is that iomap doesn't have the necessary > information to cause extent allocation, so if you do an O_SYNC write > to an extent which is HOLE or DELALLOC, we can't do this optimisation. > Maybe that doesn't really matter for interesting applications. I suspect > it doesn't matter for ZoneFS. > > Matthew Wilcox (Oracle) (10): > iomap: Pass struct iomap to iomap_alloc_ioend() > iomap: Remove iomap_writepage_ctx from iomap_can_add_to_ioend() > iomap: Do not pass iomap_writepage_ctx to iomap_add_to_ioend() > iomap: Accept a NULL iomap_writepage_ctx in iomap_submit_ioend() > iomap: Allow a NULL writeback_control argument to iomap_alloc_ioend() > iomap: Pass a length to iomap_add_to_ioend() > iomap: Reorder functions > iomap: Reorder functions > iomap: Add writethrough for O_SYNC > remove write_through bool > > fs/iomap/buffered-io.c | 492 +++++++++++++++++++++++------------------ > 1 file changed, 273 insertions(+), 219 deletions(-) > -- Damien Le Moal Western Digital Research