Re: [LSF/MM/BPF TOPIC] untorn buffered writes

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 28 Feb 2024 14:11:06 +0000

On Wed, Feb 28, 2024 at 12:12:57AM -0600, Theodore Ts'o wrote:
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes.   Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag.   In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time.   This very quickly becomes a mess.

I'm not entirely sure that it does become a mess.  If our implementation
of this ensures that each write ends up in a single folio (even if the
entire folio is larger than the write), then we will have satisfied the
semantics of the flag.

That's not to say that such an implementation would be easy.  We'd have
to be able to allocate a folio of the correct size (or fail the I/O),
and we'd have to cope with already-present smaller-than-needed folios
in the page cache, but it seems like a SMOP.

> Another interface that one be much simpler to implement for buffered
> writes would be one the untorn write granularity is set on a per-file
> descriptor basis, using fcntl(2).  We validate whether the untorn
> write granularity is one that can be supported when fcntl(2) is
> called, and we also store in the inode the largest untorn write
> granularity that has been requested by a file descriptor for that
> inode.  (When the last file descriptor opened for writing has been
> closed, the largest untorn write granularity for that inode can be set
> back down to zero.)

I'm not opposed to this API either.

> The write(2) system call will check whether the size and alignment of
> the write are valid given the requested untorn write granularity.  And
> in the writeback path, the writeback will detect if there are
> contiguous (aligned) dirty pages, and make sure they are sent to the
> storage device in multiples of the largest requested untorn write
> granularity.  This provides only the guarantees required by databases,
> and obviates the need to track which pages were dirtied by an
> RWF_ATOMIC flag, and the size of the RWF_ATOMIC write.

I think we'd be better off treating RWF_ATOMIC like it's a bs>PS device.
That takes two somewhat special cases and makes them use the same code
paths, which probably means fewer bugs as both camps will be testing
the same code.