Re: [LSF/MM/BPF TOPIC] untorn buffered writes

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 29 Feb 2024 11:52:50 +1100

On Wed, Feb 28, 2024 at 12:12:57AM -0600, Theodore Ts'o wrote:
> Last year, I talked about an interest to provide database such as
> MySQL with the ability to issue writes that would not be torn as they
> write 16k database pages[1].
> 
> [1] https://lwn.net/Articles/932900/
> 
> There is a patch set being worked on by John Garry which provides
> stronger guarantees than what is actually required for this use case,
> called "atomic writes".  The proposed interface for this facility
> involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests
> that the specific write be written to the storage device in an
> all-or-nothing fashion, and if it can not be guaranteed, that the
> write should fail.  In this interface, if the userspace sends an 128k
> write with the RWF_ATOMIC flag, if the storage device will support
> that an all-or-nothing write with the given size and alignment the
> kernel will guarantee that it will be sent as a single 128k request
> --- although from the database perspective, if it is using 16k
> database pages, it only needs to guarantee that if the write is torn,
> it only happen on a 16k boundary.  That is, if the write is split into
> 32k and 96k request, that would be totally fine as far as the database
> is concerned --- and so the RWF_ATOMIC interface is a stronger
> guarantee than what might be needed.
> 
> So far, the "atomic write" patchset has only focused on Direct I/O,
> where this stronger guarantee is mostly harmless, even if it is
> unneeded for the original motivating use case.  Which might be OK,
> since perhaps there might be other future use cases where they might
> want some 32k writes to be "atomic", while other 128k writes might
> want to be "atomic" (that is to say, persisted with all-or-nothing
> semantics), and the proposed RWF_ATOMIC interface might permit that
> --- even though no one can seem top come up with a credible use case
> that would require this.
> 
> 
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes.   Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag. 

Not problematic at all, we're already intending to handle this
"software RWF_ATOMIC" situation for buffered writes in XFS via a
forced COW operation.  That is, we'll allocate new blocks for the
write, and then when the data IO is complete we'll do an atomic swap
of the new data extent over the old one. We'll probably even enable
this for direct IO on hardware that doesn't support REQ_ATOMIC so
that software can just code for RWF_ATOMIC existing for all types of
IO on XFS....

> In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time.   This very quickly becomes a mess.

The simplification of this is using a single high-order folio for
the RWF_ATOMIC write data, then there's just a single folio that
needs to be written back. RWF_ATOMIC alreayd has a constraint of
only being supported for aligned power of 2 IOs, so it matches
hig-order folio cache indexing exactly. We can then run RWF_ATOMIC
IO as a write-through operation (i.e.  fdatawrite_range()) and IO
completion will then swap the entire range with the new data.

Hence on return from the syscall, we have new data on disk, and the
only thing that we need to do to make it permanent is commit the
journal (e.g. via RWF_DSYNC or explicit fdatasync()). This largely
makes the software RWF_ATOMIC behave exactly the same as hardware
based direct IO RWF_ATOMIC. i.e. the atomic extent swap on data IO
compeltion is the data integrity pivot that provides the RWF_ATOMIC
semantics, not the REQ_ATOMIC bio flag...

Yes, I know that ext4 has neither COW nor high order folio support,
but that just means that ext4 needs to add high-order folio support
and whatever internal code it needs to implement write-anywhere data
semantics for software RWF_ATOMIC support.

> Another interface that one be much simpler to implement for buffered
> writes would be one the untorn write granularity is set on a per-file
> descriptor basis, using fcntl(2).  We validate whether the untorn
> write granularity is one that can be supported when fcntl(2) is
> called, and we also store in the inode the largest untorn write
> granularity that has been requested by a file descriptor for that
> inode.  (When the last file descriptor opened for writing has been
> closed, the largest untorn write granularity for that inode can be set
> back down to zero.)

fcntl has already been rejected for reasons (i.e. alignment is a
persistent inode property, not a ephemeral file property). The way
we intend to do this is via fsx.fsx_extsize hints and a
FS_XFLAG_FORCEALIGN control of a on-disk inode flag. This triggers
all the alignment restrictions needed to guarantee atomic writes
from the filesystem and/or hardware.

> I'd like to discuss at LSF/MM what the best interface would be for
> buffered, untorn writes (I am deliberately avoiding the use of the
> word "atomic" since that presumes stronger guarantees than what we
> need, and because it has led to confusion in previous discussions),
> and what might be needed to support it.

I think we're almost all the way there already, and that it is
likely this will already be scheduled for discussion at LSFMM...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx