Re: [PATCH v6 00/10] block atomic writes

John Garry <john.g.garry@xxxxxxxxxx> · Wed, 27 Mar 2024 13:37:41 +0000

On 27/03/2024 03:50, Matthew Wilcox wrote:
On Tue, Mar 26, 2024 at 01:38:03PM +0000, John Garry wrote:
The goal here is to provide an interface that allows applications use
application-specific block sizes larger than logical block size
reported by the storage device or larger than filesystem block size as
reported by stat().

With this new interface, application blocks will never be torn or
fractured when written. For a power fail, for each individual application
block, all or none of the data to be written. A racing atomic write and
read will mean that the read sees all the old data or all the new data,
but never a mix of old and new.

Three new fields are added to struct statx - atomic_write_unit_min,
atomic_write_unit_max, and atomic_write_segments_max. For each atomic
individual write, the total length of a write must be a between
atomic_write_unit_min and atomic_write_unit_max, inclusive, and a
power-of-2. The write must also be at a natural offset in the file
wrt the write length. For pwritev2, iovcnt is limited by
atomic_write_segments_max.

There has been some discussion on supporting buffered IO and whether the
API is suitable, like:
https://lore.kernel.org/linux-nvme/ZeembVG-ygFal6Eb@xxxxxxxxxxxxxxxxxxxx/

Specifically the concern is that supporting a range of sizes of atomic IO
in the pagecache is complex to support. For this, my idea is that FSes can
fix atomic_write_unit_min and atomic_write_unit_max at the same size, the
extent alignment size, which should be easier to support. We may need to
implement O_ATOMIC to avoid mixing atomic and non-atomic IOs for this. I
have no proposed solution for atomic write buffered IO for bdev file
operations, but I know of no requirement for this.

The thing is that there's no requirement for an interface as complex as
the one you're proposing here.  I've talked to a few database people
and all they want is to increase the untorn write boundary from "one
disc block" to one database block, typically 8kB or 16kB.

So they would be quite happy with a much simpler interface where they
set the inode block size at inode creation time,

We want to support untorn writes for bdev file operations - how can we 
set the inode block size there? Currently it is based on logical block size.

and then all writes to
that inode were guaranteed to be untorn.  This would also be simpler to
implement for buffered writes.

We did consider that. Won't that lead to the possibility of breaking 
existing applications which want to do regular unaligned writes to these 
files? We do know that mysql/innodb does have some "compressed" mode of 
operation, which involves regular writes to the same file which wants 
untorn writes.

Furthermore, untorn writes in HW are expensive - for SCSI anyway. Do we 
always want these for such a file?

We saw untorn writes as not being a property of the file or even the 
inode itself, but rather an attribute of the specific IO being issued 
from the userspace application.

Who's asking for this more complex interface?

It's not a case of someone specifically asking for this interface. This 
is just a proposal to satisfy userspace requirement to do untorn writes 
in a generic way.

From a user point-of-view, untorn writes for a regular file can be 
enabled for up to a specific size* with FS_IOC_SETFLAGS API. Then they 
need to follow alignment and size rules for issuing untorn writes, but 
they would always need to do this. In addition, the user may still issue 
regular (tearable) writes to the file.

* I think that we could change this to only allow writes for that 
specific size, which was my proposal for buffered IO.

Thanks,
John