Re: [RFC PATCH 00/10] Make O_SYNC writethrough

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Tue, 3 May 2022 21:57:10 +0900

On 2022/05/03 15:39, Matthew Wilcox (Oracle) wrote:
> This is very much in development and basically untested, but Damian

s/Damian/Damien :)

Thank you for posting this. I am definitely going to play with this with zonefs.

The goal is to allow replacing the mandatory O_DIRECT writing of sequential zone
files with sector aligned O_SYNC writes which "preload" the page cache for
subsequent buffered reads, thus reducing device accesses. That will also avoid
an annoying overhead with zonefs which is that applications need 2 file
descriptors per zone file: one without O_DIRECT for buffered reads and another
O_DIRECT one for writes.

In the case of zonefs, since all sequential files are always fully mapped,
allocated, cannot be used for mmap writing *and* a write is never an overwrite,
these conditions:

+	if (folio_test_dirty(folio))
+		return true;

+	/* Can't allocate blocks here because we don't have ->prepare_ioend */
+	if (iomap->type != IOMAP_MAPPED || iomap->type != IOMAP_UNWRITTEN ||
+	    iomap->flags & IOMAP_F_SHARED)
+		return false;

never trigger and the writethrough is always started with
folio_start_writeback(), essentially becoming a "direct" write from the issuer
context (under the inode lock) on the entire folio. And that should guarantee
that writes stay sequential as they must.

> started describing to me something that he wanted, and I told him he
> was asking for the wrong thing, and I already had this patch series
> in progress.  If someone wants to pick it up and make it mergable,
> that'd be grand.
> 
> The idea is that an O_SYNC write is always going to want to write, and
> we know that at the time we're storing into the page cache.  So for an
> otherwise clean folio, we can skip the part where we dirty the folio,
> find the dirty folios and wait for their writeback.  We can just mark the
> folio as writeback-in-progress and start the IO there and then (where we
> know exactly which blocks need to be written, so possibly a smaller I/O
> than writing the entire page).  The existing "find dirty pages, start
> I/O and wait on them" code will end up waiting on this pre-started I/O
> to complete, even though it didn't start any of its own I/O.
> 
> The important part is patch 9.  Everything before it is boring prep work.
> I'm in two minds about whether to keep the 'write_through' bool, or
> remove it.  So feel to read patches 9+10 squashed together, or as if
> patch 10 doesn't exist.  Whichever feels better.
> 
> The biggest problem with all this is that iomap doesn't have the necessary
> information to cause extent allocation, so if you do an O_SYNC write
> to an extent which is HOLE or DELALLOC, we can't do this optimisation.
> Maybe that doesn't really matter for interesting applications.  I suspect
> it doesn't matter for ZoneFS.
> 
> Matthew Wilcox (Oracle) (10):
>   iomap: Pass struct iomap to iomap_alloc_ioend()
>   iomap: Remove iomap_writepage_ctx from iomap_can_add_to_ioend()
>   iomap: Do not pass iomap_writepage_ctx to iomap_add_to_ioend()
>   iomap: Accept a NULL iomap_writepage_ctx in iomap_submit_ioend()
>   iomap: Allow a NULL writeback_control argument to iomap_alloc_ioend()
>   iomap: Pass a length to iomap_add_to_ioend()
>   iomap: Reorder functions
>   iomap: Reorder functions
>   iomap: Add writethrough for O_SYNC
>   remove write_through bool
> 
>  fs/iomap/buffered-io.c | 492 +++++++++++++++++++++++------------------
>  1 file changed, 273 insertions(+), 219 deletions(-)
> 

-- 
Damien Le Moal
Western Digital Research