On 28/02/2024 06:12, Theodore Ts'o wrote:
However, this proposed interface is highly problematic when it comes
to buffered writes, and Postgress database uses buffered, not direct
I/O writes. Suppose the database performs a 16k write, followed by a
64k write, followed by a 128k write --- and these writes are done
using a file descriptor that does not have O_DIRECT enable, and let's
suppose they are written using the proposed RWF_ATOMIC flag. In
order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
kernel would need to store the fact that certain pages in the page
cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
writeback code knows what the "atomic" guarantee that was made at
write time. This very quickly becomes a mess.
Having done some research, postgres has a fixed "page" size per file and
this is typically 8KB. This is configured at compile time. Page size may
be different between certain file types, but it is possible to have all
file types be configured for the same page size. This all seems like
standard DB stuff.
So, as I mentioned in response to Matthew here:
https://lore.kernel.org/linux-scsi/47d264c2-bc97-4313-bce0-737557312106@xxxxxxxxxx/
.. for untorn buffered writes support, we could just set
atomic_write_unit_min = atomic_write_unit_max = FS file alignment
granule = DB page size. That would seem easier to support in the page
cache and still provide the RWF_ATOMIC guarantee. For ext4, bigalloc
cluster size could be this FS file alignment granule. For XFS, it would
be the extsize with forcealign.
It might be argued that we would like to submit larger untorn write IOs
from userspace for performance benefit and allow the kernel to split on
some page boundary, but I doubt that this will be utilised by userspace.
On the other hand, the block atomic writes kernel series does support
block layer merging (of atomic writes).
About advertising untorn buffered write capability, current statx fields
update for atomic writes is here:
https://lore.kernel.org/linux-api/20240124112731.28579-2-john.g.garry@xxxxxxxxxx/
Only direct IO support is mentioned there. For supporting buffered IO, I
suppose an additional flag can be added for getting buffered IO info,
like STATX_ATTR_WRITE_ATOMIC_BUFFERED, and reuse atomic_write_unit_{min,
max, segments_max} fields for buffered IO. Setting the direct IO and
buffered IO flags would be mutually exclusive.
Is there any anticipated problem with this idea?
On another topic, there is some development to allow postgres to use
direct IO, see:
https://wiki.postgresql.org/wiki/AIO
Assuming all info there is accurate and up to date, it does still seem
to be lagging kernel untorn write support.
John