Re: [PATCH v6 00/10] block atomic writes

John Garry <john.g.garry@xxxxxxxxxx> · Fri, 5 Apr 2024 11:06:00 +0100

On 04/04/2024 17:48, Matthew Wilcox wrote:
The thing is that there's no requirement for an interface as complex as
the one you're proposing here.  I've talked to a few database people
and all they want is to increase the untorn write boundary from "one
disc block" to one database block, typically 8kB or 16kB.

So they would be quite happy with a much simpler interface where they
set the inode block size at inode creation time,
We want to support untorn writes for bdev file operations - how can we set
the inode block size there? Currently it is based on logical block size.
ioctl(BLKBSZSET), I guess?  That currently limits to PAGE_SIZE, but I
think we can remove that limitation with the bs>PS patches.

We want a consistent interface for bdev and regular files, so that would 
need to work for FSes also. FSes(XFS) work based on a homogeneous inode 
blocksize, which is the SB blocksize.

Furthermore, we would seem to be mixing different concepts here. 
Currently in Linux we say that a logical block size write is atomic. In 
the block layer, we split BIOs on LBS boundaries. iomap creates BIOs 
based on LBS boundaries. But writing a FS block is not always guaranteed 
to be atomic, as far as I'm concerned. So just increasing the inode 
block size / FS block size does not really change anything, in itself.

and then all writes to
that inode were guaranteed to be untorn.  This would also be simpler to
implement for buffered writes.
We did consider that. Won't that lead to the possibility of breaking
existing applications which want to do regular unaligned writes to these
files? We do know that mysql/innodb does have some "compressed" mode of
operation, which involves regular writes to the same file which wants untorn
writes.
If you're talking about "regular unaligned buffered writes", then that
won't break.  If you cross a folio boundary, the result may be torn,
but if you're crossing a block boundary you expect that.

Furthermore, untorn writes in HW are expensive - for SCSI anyway. Do we
always want these for such a file?
Do untorn writes actually exist in SCSI?  I was under the impression
nobody had actually implemented them in SCSI hardware.

I know that some SCSI targets actually atomically write data in chunks > 
LBS. Obviously atomic vs non-atomic performance is a moot point there, 
as data is implicitly always atomically written.

We actually have an mysql/innodb port of this API working on such a SCSI 
target.

However I am not sure about atomic write support for other SCSI targets.

We saw untorn writes as not being a property of the file or even the inode
itself, but rather an attribute of the specific IO being issued from the
userspace application.
The problem is that keeping track of that is expensive for buffered
writes.  It's a model that only works for direct IO.  Arguably we
could make it work for O_SYNC buffered IO, but that'll require some
surgery.

To me, O_ATOMIC would be required for buffered atomic writes IO, as we 
want a fixed-sized IO, so that would mean no mixing of atomic and 
non-atomic IO.

Thanks,
John