Re: [LSF/MM/BPF TOPIC] untorn buffered writes

John Garry <john.g.garry@xxxxxxxxxx> · Wed, 15 May 2024 13:54:39 -0600

On 27/02/2024 23:12, Theodore Ts'o wrote:
Last year, I talked about an interest to provide database such as
MySQL with the ability to issue writes that would not be torn as they
write 16k database pages[1].

[1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$

After discussing this topic earlier this week, I would like to know if 
there are still objections or concerns with the untorn-writes userspace 
API proposed in 
https://lore.kernel.org/linux-block/20240326133813.3224593-1-john.g.garry@xxxxxxxxxx/

I feel that the series for supporting direct-IO only, above, is stuck 
because of this topic of buffered IO.

So I sent an RFC for buffered untorn-writes last month in 
https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@xxxxxxxxxx/, 
which did leverage the bs > ps effort. Maybe it did not get noticed due 
to being an RFC. It works on the following principles:

- A buffered atomic write requires RWF_ATOMIC flag be set, same as
  direct IO. The same other atomic writes rules apply.
- For an inode, only a single size of buffered write is allowed. So for
  statx, atomic_write_unit_min = atomic_write_unit_max always for
  buffered atomic writes.
- A single folio maps to an atomic write in the pagecache. So inode
  address_space folio min order = max order = atomic_write_unit_min/max
- A folio is tagged as "atomic" when atomically written and written back
  to storage "atomically", same as direct-IO method would do for an
  atomic write.
- If userspace wants to guarantee a buffered atomic write is written to
  storage atomically after the write syscall returns, it must use
  RWF_SYNC or similar (along with RWF_ATOMIC).

This is all along the lines of what I described on Monday.

There are no concrete semantics for buffered untorn-writes ATM - like 
mixing RWF_ATOMIC write with non-RWF_ATOMIC writes in the pagecache - 
but I don't think that this needs to be formalized yet. Or, if it really 
does, let me know.

There was also talk in the "limits of buffered IO.. " session - as I 
understand - that RWF_ATOMIC for buffered IO should be writethough. If 
anyone wants to discuss that further or describe that issue, then please do.

Anyway, I plan to push the direct IO series for merging in the next 
cycle, so let me know of what else to discuss and get conclusion on.

There is a patch set being worked on by John Garry which provides
stronger guarantees than what is actually required for this use case,
called "atomic writes".  The proposed interface for this facility
involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests
that the specific write be written to the storage device in an
all-or-nothing fashion, and if it can not be guaranteed, that the
write should fail.  In this interface, if the userspace sends an 128k
write with the RWF_ATOMIC flag, if the storage device will support
that an all-or-nothing write with the given size and alignment the
kernel will guarantee that it will be sent as a single 128k request
--- although from the database perspective, if it is using 16k
database pages, it only needs to guarantee that if the write is torn,
it only happen on a 16k boundary.  That is, if the write is split into
32k and 96k request, that would be totally fine as far as the database
is concerned --- and so the RWF_ATOMIC interface is a stronger
guarantee than what might be needed.

So far, the "atomic write" patchset has only focused on Direct I/O,
where this stronger guarantee is mostly harmless, even if it is
unneeded for the original motivating use case.  Which might be OK,
since perhaps there might be other future use cases where they might
want some 32k writes to be "atomic", while other 128k writes might
want to be "atomic" (that is to say, persisted with all-or-nothing
semantics), and the proposed RWF_ATOMIC interface might permit that
--- even though no one can seem top come up with a credible use case
that would require this.

However, this proposed interface is highly problematic when it comes
to buffered writes, and Postgress database uses buffered, not direct
I/O writes.   Suppose the database performs a 16k write, followed by a
64k write, followed by a 128k write --- and these writes are done
using a file descriptor that does not have O_DIRECT enable, and let's
suppose they are written using the proposed RWF_ATOMIC flag.   In
order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
kernel would need to store the fact that certain pages in the page
cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
writeback code knows what the "atomic" guarantee that was made at
write time.   This very quickly becomes a mess.

Another interface that one be much simpler to implement for buffered
writes would be one the untorn write granularity is set on a per-file
descriptor basis, using fcntl(2).  We validate whether the untorn
write granularity is one that can be supported when fcntl(2) is
called, and we also store in the inode the largest untorn write
granularity that has been requested by a file descriptor for that
inode.  (When the last file descriptor opened for writing has been
closed, the largest untorn write granularity for that inode can be set
back down to zero.)

The write(2) system call will check whether the size and alignment of
the write are valid given the requested untorn write granularity.  And
in the writeback path, the writeback will detect if there are
contiguous (aligned) dirty pages, and make sure they are sent to the
storage device in multiples of the largest requested untorn write
granularity.  This provides only the guarantees required by databases,
and obviates the need to track which pages were dirtied by an
RWF_ATOMIC flag, and the size of the RWF_ATOMIC write.

I'd like to discuss at LSF/MM what the best interface would be for
buffered, untorn writes (I am deliberately avoiding the use of the
word "atomic" since that presumes stronger guarantees than what we
need, and because it has led to confusion in previous discussions),
and what might be needed to support it.

						- Ted