Re: [LSF/MM/BPF TOPIC] untorn buffered writes

"Theodore Ts'o" <tytso@xxxxxxx> · Wed, 28 Feb 2024 17:24:15 -0600

On Wed, Feb 28, 2024 at 04:06:43PM +0000, John Garry wrote:
> 
> Note that the initial RFC for my series did propose an interface that does
> allow a write to be split in the kernel on a boundary, and that boundary was
> evaluated on a per-write basis by the length and alignment of the write
> along with any extent alignment granularity.
> 
> We decided not to pursue that, and instead require a write per 16K page, for
> the example above.

Yes, I did see that.  And that leads to the problem where if you do an
RWF_ATOMIC write which is 32k, then we are promising that it will be
sent as a single 32k SCSI or NVMe request --- even though that isn't
required by the database, the API is *promising* that we will honor
it.  But that leads to the problem where for buffered writes, we need
to track which dirty pages are part of write #1, where we had promised
a 32k "atomic" write, which pages were part of writes #2, and #3,
which were each promised to be 16k "atomic writes", and which pages
were part of write #4, which was promised to be a 64k write.  If the
pages dirtied by writes #1, #2, and #3, and #4 are all contiguous, how
do we know what promise we had made about which pages should be
atomically sent together in a single write request?  Do we have to
store all of this information somewhere in the struct page or struct
folio?

And if we use Matthew's suggestion that we treat each folio as the
atomic write unit, does that mean that we have to break part or join
folios together depending on which writes were sent with an RWF_ATOMIC
write flag and by their size?

You see?  This is why I think the RWF_ATOMIC flag, which was mostly
harmless when it over-promised unneeded semantics for Direct I/O, is
actively harmful and problematic for buffered I/O.

> 
> If you check the latest discussion on XFS support we are proposing something
> along those lines:
> https://lore.kernel.org/linux-fsdevel/Zc1GwE%2F7QJisKZCX@xxxxxxxxxxxxxxxxxxx/
> 
> There FS_IOC_FSSETXATTR would be used to set extent size w/ fsx.fsx_extsize
> and new flag FS_XGLAG_FORCEALIGN to guarantee extent alignment, and this
> alignment would be the largest untorn write granularity.
> 
> Note that I already got push back on using fcntl for this.

There are two separable untorn write granularity that you might need to
set, One is specifying the constraints that must be required for all
block allocations associated with the file.  This needs to be
persistent, and stored with the file or directory (or for the entire
file system; I'll talk about this option in a moment) so that we know
that a particular file has blocks allocated in contiguous chunks with
the correct alignment so we can make the untorn write guarantee.
Since this needs to be persistent, and set when the file is first
created, that's why I could imagine that someone pushed back on using
fcntl(2) --- since fcntl is a property of the file descriptor, not of
the inode, and when you close the file descriptor, nothing that you
set via fcntl(2) is persisted.

However, the second untorn write granularity which is required for
writes using a particular file descriptor.  And please note that these
two values don't necessarily need to be the same.  For example, if the
first granularity is 32k, such that block allocations are done in 32k
clusters, aligned on 32k boundaries, then you can provide untorn write
guarantees of 8k, 16k, or 32k ---- so long as (a) the file or block
device has the appropriate alignment guarantees, and (b) the hardware
can support untorn write guarantees of the requested size.

And for some file systems, and for block devices, you might not need
to set the first untorn write granularity size at all.  For example,
if the block device represents the entire disk, or represents a
partition which is aligned on a 1MB boundary (which tends to be case
for GPT partitions IIRC), then we don't need to set any kind of magic
persistent granularity size, because it's a fundamental propert of the
partition.  As another example, ext4 has the bigalloc file system
feature, which allows you to set at file system creation time, a
cluster allocation size which is a power of two multiple of the
blocksize.  So for example, if you have a block size of 4k, and
block/cluster ratio is 16, then the cluster size is 64k, and all data
blocks will be done in aligned 64k chunks.

The ext4 bigalloc feature has been around since 2011, so it's
something that can be enabled even for a really ancient distro kernel.
:-) Hence, we don't actually *need* any file system format changes.
If there was a way that we could set a requeted untorn write
granularity size associated with all writes to a particular file
descriptor, via fcntl(2), that's all we actually need.  That is, we
just need the non-persistent, file descriptor-specific write
granularity parameter which applies to writes; and this would work for
raw block devices, where we wouldn't have any *place* to store file
attribute.  And like with ext4 bigalloc file systems, we don't need
any file system format changes in order to support untorn writes for
block devices, so long as the starting offset of the block device
(zero if it's the whole disk) is appropriately aligned.

Cheers,

						- Ted