On 28/02/2024 06:12, Theodore Ts'o wrote:
Last year, I talked about an interest to provide database such as
MySQL with the ability to issue writes that would not be torn as they
write 16k database pages[1].
[1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$
There is a patch set being worked on by John Garry which provides
stronger guarantees than what is actually required for this use case,
called "atomic writes". The proposed interface for this facility
involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests
that the specific write be written to the storage device in an
all-or-nothing fashion, and if it can not be guaranteed, that the
write should fail. In this interface, if the userspace sends an 128k
write with the RWF_ATOMIC flag, if the storage device will support
that an all-or-nothing write with the given size and alignment the
kernel will guarantee that it will be sent as a single 128k request
--- although from the database perspective, if it is using 16k
database pages, it only needs to guarantee that if the write is torn,
it only happen on a 16k boundary. That is, if the write is split into
32k and 96k request, that would be totally fine as far as the database
is concerned --- and so the RWF_ATOMIC interface is a stronger
guarantee than what might be needed.
Note that the initial RFC for my series did propose an interface that
does allow a write to be split in the kernel on a boundary, and that
boundary was evaluated on a per-write basis by the length and alignment
of the write along with any extent alignment granularity.
We decided not to pursue that, and instead require a write per 16K page,
for the example above.
So far, the "atomic write" patchset has only focused on Direct I/O,
where this stronger guarantee is mostly harmless, even if it is
unneeded for the original motivating use case. Which might be OK,
since perhaps there might be other future use cases where they might
want some 32k writes to be "atomic", while other 128k writes might
want to be "atomic" (that is to say, persisted with all-or-nothing
semantics), and the proposed RWF_ATOMIC interface might permit that
--- even though no one can seem top come up with a credible use case
that would require this.
However, this proposed interface is highly problematic when it comes
to buffered writes, and Postgress database uses buffered, not direct
I/O writes. Suppose the database performs a 16k write, followed by a
64k write, followed by a 128k write --- and these writes are done
using a file descriptor that does not have O_DIRECT enable, and let's
suppose they are written using the proposed RWF_ATOMIC flag. In
order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
kernel would need to store the fact that certain pages in the page
cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
writeback code knows what the "atomic" guarantee that was made at
write time. This very quickly becomes a mess. >
Another interface that one be much simpler to implement for buffered
writes would be one the untorn write granularity is set on a per-file
descriptor basis, using fcntl(2). We validate whether the untorn
write granularity is one that can be supported when fcntl(2) is
called, and we also store in the inode the largest untorn write
granularity that has been requested by a file descriptor for that
inode. (When the last file descriptor opened for writing has been
closed, the largest untorn write granularity for that inode can be set
back down to zero.)
If you check the latest discussion on XFS support we are proposing
something along those lines:
https://lore.kernel.org/linux-fsdevel/Zc1GwE%2F7QJisKZCX@xxxxxxxxxxxxxxxxxxx/
There FS_IOC_FSSETXATTR would be used to set extent size w/
fsx.fsx_extsize and new flag FS_XGLAG_FORCEALIGN to guarantee extent
alignment, and this alignment would be the largest untorn write granularity.
Note that I already got push back on using fcntl for this.
So whether you are more interested in ext4 and how ext4 can adopt that
API is another matter.. but I did consider adding something like struct
inode.i_blkbits for this untorn write granularity, so an FS would just
need to set that. But I am not proposing that ATM.
The write(2) system call will check whether the size and alignment of
the write are valid given the requested untorn write granularity. And
in the writeback path, the writeback will detect if there are
contiguous (aligned) dirty pages, and make sure they are sent to the
storage device in multiples of the largest requested untorn write
granularity. This provides only the guarantees required by databases,
and obviates the need to track which pages were dirtied by an
RWF_ATOMIC flag, and the size of the RWF_ATOMIC write.
I'd like to discuss at LSF/MM what the best interface would be for
buffered, untorn writes (I am deliberately avoiding the use of the
word "atomic" since that presumes stronger guarantees than what we
need, and because it has led to confusion in previous discussions),
and what might be needed to support it.
Thanks,
John