Re: [Lsf-pc] [LSF/MM/BPF TOPIC] untorn buffered writes

"Theodore Ts'o" <tytso@xxxxxxx> · Wed, 28 Feb 2024 14:21:03 -0600

On Wed, Feb 28, 2024 at 01:38:44PM +0200, Amir Goldstein wrote:
> 
> Seems a duplicate of this topic proposed by Luis?
> 
> https://lore.kernel.org/linux-fsdevel/ZdfDxN26VOFaT_Tv@xxxxxxxxxxxxxxxxxxxxxx/

Maybe.  I did see Luis's topic, but it seemed to me to be largely
orthogonal to what I was interested in talking about.  Maybe I'm
missing something, but my observations were largely similar to Dave
Chinner's comments here:

https://lore.kernel.org/r/ZdvXAn1Q%2F+QX5sPQ@xxxxxxxxxxxxxxxxxxx/

To wit, there are two cases here; either the desired untorn write
granularity is smaller than the large block size, in which case there
really nothing that needs to be done from an API perspective.
Alternatively, if the desired untorn granularity is *larger* than the
large block size, then the API considerations are the same with or
without LBS support.

>From the implementation perspective, yes, there is a certain amount of
commonality, but that to me is relatively trivial --- or at least, it
isn't a particular subtle design.  That is, in the writeback code, it
needs to know what the desired write granularity, whether it is
required by the device because the logical sector size is larger than
the page size, or because there is an untorn write granularity
requested by the userspace process doing the writing (in practice,
pretty much always 16k for databases).  In terms of what the writeback
code needs to do, it needs to make sure that gathers up pages
respecting the alignment and required size, and if a page is locked,
we have to wait until it is available, instead of skipping that page
in the case of a non-data-integrity writeback.

As far as tooling/testing is concerned, against, it appears to me that
the requirements of LBA and the desire for untorn writes in units of
granularity larger than the block size are quite orthogonal.  For LBA,
all you need is some kind of synthetic/debug device which has a
logical block size larger than the page size.  This could be done a
number of ways:

    * via the VMM --- e.g., a QEMU block device that has a 64k logical
      sector size.
    * via loop device that exports a larger logical sector size
    * via blktrace (or its ebpf or ftrace) and making sure that size of every
      write request is the right multiple of 512 byte sectors

For testing untorn writes, life is a bit tricker, because not all
writes will be larger than the page size.  For example, we might have
an ext4 file system with a 4k blocksize, so metadata writes to the
inode table, etc., will be in 4k writes.  However, when writing to the
database file, *those* writes need to be in multiples of 16k, with 16k
alignment required, and if a write needs to be broken up it must be at
a 16k boundary.

The tooling for this, which is untorn write specific, and completely
irrelevant for the LBS case, needs to know which parts of the storage
device are assigned to the database file --- and which are not.  If
the database file is not getting deleted or truncated, it's relatively
easy to take a blktrace (or ebpf or ftrace equivalent) and validate
all of the I/O's, after the fact.  The tooling to do this isn't
terribly complicated, would involve using filefrag -v if the file
system is already mounted, and a file system specific tool (i.e.,
debugfs for ext4, or xfs_db for xfs) if the file system is not mounted.

Cheers,

					- Ted