Re: [RFC PATCH 00/10] Make O_SYNC writethrough

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Thu, 5 May 2022 06:07:11 +0100

On Thu, May 05, 2022 at 02:58:21PM +1000, Dave Chinner wrote:
> On Tue, May 03, 2022 at 07:39:58AM +0100, Matthew Wilcox (Oracle) wrote:
> > This is very much in development and basically untested, but Damian
> > started describing to me something that he wanted, and I told him he
> > was asking for the wrong thing, and I already had this patch series
> > in progress.  If someone wants to pick it up and make it mergable,
> > that'd be grand.
> 
> That've very non-descriptive. Saying "someone wanted something, I said it's
> wrong, so here's a patch series about something else" doesn't tell me anything
> about the problem that Damien was trying to solve.

Sorry about that.  I was a bit jet-lagged when I wrote it.

> > The idea is that an O_SYNC write is always going to want to write, and
> > we know that at the time we're storing into the page cache.  So for an
> > otherwise clean folio, we can skip the part where we dirty the folio,
> > find the dirty folios and wait for their writeback.
> 
> What exactly is this shortcut trying to optimise away? A bit of CPU
> time?
> 
> O_SYNC is already a write-through operation - we just call
> filemap_write_and_wait_range() once we've copied the data into the
> page cache and dirtied the page. What does skipping the dirty page
> step gain us?

Two things; the original reason I was doing this, and Damien's reason.

My reason: a small write to a large folio will cause the entire folio to
be dirtied and written.  This is unnecessary with O_SYNC; we're about
to force the write anyway; we may as well do the write of the part of
the folio which is modified, and skip the whole dirtying step.

Damien's reason: It's racy.  Somebody else (... even vmscan) could cause
folios to be written out of order.  This matters for ZoneFS because
writing a file out of order is Not Allowed.  He was looking at relaxing
O_DIRECT, but I think what he really wants is a writethrough page cache.

> > We can just mark the
> > folio as writeback-in-progress and start the IO there and then (where we
> > know exactly which blocks need to be written, so possibly a smaller I/O
> > than writing the entire page).  The existing "find dirty pages, start
> > I/O and wait on them" code will end up waiting on this pre-started I/O
> > to complete, even though it didn't start any of its own I/O.
> > 
> > The important part is patch 9.  Everything before it is boring prep work.
> > I'm in two minds about whether to keep the 'write_through' bool, or
> > remove it.  So feel to read patches 9+10 squashed together, or as if
> > patch 10 doesn't exist.  Whichever feels better.
> > 
> > The biggest problem with all this is that iomap doesn't have the necessary
> > information to cause extent allocation, so if you do an O_SYNC write
> > to an extent which is HOLE or DELALLOC, we can't do this optimisation.
> > Maybe that doesn't really matter for interesting applications.  I suspect
> > it doesn't matter for ZoneFS.
> 
> This seems like a lot of complexity for only partial support. It
> introduces races with page dirtying and cleaning, it likely has
> interesting issues with all the VM dirty/writeback accounting
> (because this series is using a completion path that expects the
> submission path has done it's side of the accounting) and it only
> works in certain preconditions are met.

If we want to have better O_SYNC support, I think we can improve those
conditions.  For example, XFS could preallocate the blocks before calling
into iomap.  Since it's an O_SYNC write, everything is already terrible.