Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes

John Garry <john.g.garry@xxxxxxxxxx> · Wed, 15 Jan 2025 09:30:33 +0000

On 14/01/2025 23:57, Darrick J. Wong wrote:
On Tue, Jan 14, 2025 at 03:41:13PM +1100, Dave Chinner wrote:
On Wed, Dec 11, 2024 at 05:34:33PM -0800, Darrick J. Wong wrote:
On Fri, Dec 06, 2024 at 08:15:05AM +1100, Dave Chinner wrote:
On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote:
e.g. look at MySQL's use of fallocate(hole punch) for transparent
data compression - nobody had forseen that hole punching would be
used like this, but it's a massive win for the applications which
store bulk compressible data in the database even though it does bad
things to the filesystem.

Spend some time looking outside the proprietary database application
box and think a little harder about the implications of atomic write
functionality.  i.e. what happens when we have ubiquitous support
for guaranteeing only the old or the new data will be seen after
a crash *without the need for using fsync*.

IOWs, the program either wants an old version or a new version of the
files that it wrote, and the commit boundary is syncfs() after updating
all the files?

Yes, though there isn't a need for syncfs() to guarantee old-or-new.
That's the sort of thing an application can choose to do at the end
of it's update set...

Well yes, there has to be a caches flush somewhere -- last I checked,
RWF_ATOMIC doesn't require that the written data be persisted after the
call completes.

Correct, RWF_ATOMIC | RWF_SYNC is required for guarantee of persistence

Think about the implications of that for a minute - for any full
file overwrite up to the hardware atomic limits, we won't need fsync
to guarantee the integrity of overwritten data anymore. We only need
a mechanism to flush the journal and device caches once all the data
has been written (e.g. syncfs)...

"up to the hardware atomic limits" -- that's a big limitation.  What if
I need to write 256K but the device only supports up to 64k?  RWF_ATOMIC
won't work.  Or what if the file range I want to dirty isn't aligned
with the atomic write alignment?  What if the awu geometry changes
online due to a device change, how do programs detect that?

If awu geometry changes dynamically in an incompatible way, then
filesystem RWF_ATOMIC alignment guarantees are fundamentally broken.
This is not a problem the filesystem can solve.

IMO, RAID device hotplug should reject new device replacement that
has incompatible atomic write support with the existing device set.
With that constraint, the whole mess of "awu can randomly change"
problems go away.

Assuming device mapper is subject to that too, I agree.

If a device is added to a md raid array which does not support atomic 
writes, then atomic writes are disabled (for the block device). I need 
to verify that hotplug behaves like this.

And dm does behave like this also, i.e. atomic writes are disabled for 
the dm block device.

Programs that aren't 100% block-based should use exchange-range.  There
are no alignment restrictions, no limits on the size you can exchange,
no file mapping state requiments to trip over, and you can update
arbitrary sparse ranges.  As long as you don't tell exchange-range to
flush the log itself, programs can use syncfs to amortize the log and
cache flush across a bunch of file content exchanges.

Right - that's kinda my point - I was assuming that we'd be using
something like xchg-range as the "unaligned slow path" for
RWF_ATOMIC.

i.e. RWF_ATOMIC as implemented by a COW capable filesystem should
always be able to succeed regardless of IO alignment. In these
situations, the REQ_ATOMIC block layer offload to the hardware is a
fast path that is enabled when the user IO and filesystem extent
alignment matches the constraints needed to do a hardware atomic
write.

In all other cases, we implement RWF_ATOMIC something like
always-cow or prealloc-beyond-eof-then-xchg-range-on-io-completion
for anything that doesn't correctly align to hardware REQ_ATOMIC.

That said, there is nothing that prevents us from first implementing
RWF_ATOMIC constraints as "must match hardware requirements exactly"
and then relaxing them to be less stringent as filesystems
implementations improve. We've relaxed the direct IO hardware
alignment constraints multiple times over the years, so there's
nothing that really prevents us from doing so with RWF_ATOMIC,
either. Especially as we have statx to tell the application exactly
what alignment will get fast hardware offloads...

Ok, let's do that then.  Just to be clear -- for any RWF_ATOMIC direct
write that's correctly aligned and targets a single mapping in the
correct state, we can build the untorn bio and submit it.  For
everything else, prealloc some post EOF blocks, write them there, and
exchange-range them.

That makes my life easier ... today, anyway.

For RWF_ATOMIC, our targeted users will want guaranteed performance, so 
would really need to know about anything which is doing software-based 
atomic writes behind the scenes.

JFYI, I did rework the zeroing code to leverage what we already have in 
iomap, and it looks better to me:

https://github.com/johnpgarry/linux/commits/atomic-write-large-atomics-v6.13-v4/

There is a problem with atomic writes over EOF, but that same be solved.

Tricky questions: How do we avoid collisions between overlapping writes?
I guess we find a free file range at the top of the file that is long
enough to stage the write, and put it there?  And purge it later?

Also, does this imply that the maximum file size is less than the usual
8EB?

(There's also the question about how to do this with buffered writes,
but I guess we could skip that for now.)

Even better, if you still wanted to use untorn block writes to persist
the temporary file's dirty data to disk, you don't even need forcealign
because the exchange-range will take care of restarting the operation
during log recovery.  I don't know that there's much point in doing that
but the idea is there.

*nod*