Re: [LSF/MM/BPF TOPIC] untorn buffered writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 28/02/2024 06:12, Theodore Ts'o wrote:
However, this proposed interface is highly problematic when it comes
to buffered writes, and Postgress database uses buffered, not direct
I/O writes.   Suppose the database performs a 16k write, followed by a
64k write, followed by a 128k write --- and these writes are done
using a file descriptor that does not have O_DIRECT enable, and let's
suppose they are written using the proposed RWF_ATOMIC flag.   In
order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
kernel would need to store the fact that certain pages in the page
cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
writeback code knows what the "atomic" guarantee that was made at
write time.   This very quickly becomes a mess.

Having done some research, postgres has a fixed "page" size per file and this is typically 8KB. This is configured at compile time. Page size may be different between certain file types, but it is possible to have all file types be configured for the same page size. This all seems like standard DB stuff.

So, as I mentioned in response to Matthew here:
https://lore.kernel.org/linux-scsi/47d264c2-bc97-4313-bce0-737557312106@xxxxxxxxxx/

.. for untorn buffered writes support, we could just set atomic_write_unit_min = atomic_write_unit_max = FS file alignment granule = DB page size. That would seem easier to support in the page cache and still provide the RWF_ATOMIC guarantee. For ext4, bigalloc cluster size could be this FS file alignment granule. For XFS, it would be the extsize with forcealign.

It might be argued that we would like to submit larger untorn write IOs from userspace for performance benefit and allow the kernel to split on some page boundary, but I doubt that this will be utilised by userspace. On the other hand, the block atomic writes kernel series does support block layer merging (of atomic writes).

About advertising untorn buffered write capability, current statx fields update for atomic writes is here:
https://lore.kernel.org/linux-api/20240124112731.28579-2-john.g.garry@xxxxxxxxxx/

Only direct IO support is mentioned there. For supporting buffered IO, I suppose an additional flag can be added for getting buffered IO info, like STATX_ATTR_WRITE_ATOMIC_BUFFERED, and reuse atomic_write_unit_{min, max, segments_max} fields for buffered IO. Setting the direct IO and buffered IO flags would be mutually exclusive.

Is there any anticipated problem with this idea?

On another topic, there is some development to allow postgres to use direct IO, see:
https://wiki.postgresql.org/wiki/AIO

Assuming all info there is accurate and up to date, it does still seem to be lagging kernel untorn write support.

John




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux