Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 14 Jan 2025 15:41:13 +1100

On Wed, Dec 11, 2024 at 05:34:33PM -0800, Darrick J. Wong wrote:
> On Fri, Dec 06, 2024 at 08:15:05AM +1100, Dave Chinner wrote:
> > On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote:
> > e.g. look at MySQL's use of fallocate(hole punch) for transparent
> > data compression - nobody had forseen that hole punching would be
> > used like this, but it's a massive win for the applications which
> > store bulk compressible data in the database even though it does bad
> > things to the filesystem.
> > 
> > Spend some time looking outside the proprietary database application
> > box and think a little harder about the implications of atomic write
> > functionality.  i.e. what happens when we have ubiquitous support
> > for guaranteeing only the old or the new data will be seen after
> > a crash *without the need for using fsync*.
> 
> IOWs, the program either wants an old version or a new version of the
> files that it wrote, and the commit boundary is syncfs() after updating
> all the files?

Yes, though there isn't a need for syncfs() to guarantee old-or-new.
That's the sort of thing an application can choose to do at the end
of it's update set...

> > Think about the implications of that for a minute - for any full
> > file overwrite up to the hardware atomic limits, we won't need fsync
> > to guarantee the integrity of overwritten data anymore. We only need
> > a mechanism to flush the journal and device caches once all the data
> > has been written (e.g. syncfs)...
> 
> "up to the hardware atomic limits" -- that's a big limitation.  What if
> I need to write 256K but the device only supports up to 64k?  RWF_ATOMIC
> won't work.  Or what if the file range I want to dirty isn't aligned
> with the atomic write alignment?  What if the awu geometry changes
> online due to a device change, how do programs detect that?

If awu geometry changes dynamically in an incompatible way, then
filesystem RWF_ATOMIC alignment guarantees are fundamentally broken.
This is not a problem the filesystem can solve.

IMO, RAID device hotplug should reject new device replacement that
has incompatible atomic write support with the existing device set.
With that constraint, the whole mess of "awu can randomly change"
problems go away.

> Programs that aren't 100% block-based should use exchange-range.  There
> are no alignment restrictions, no limits on the size you can exchange,
> no file mapping state requiments to trip over, and you can update
> arbitrary sparse ranges.  As long as you don't tell exchange-range to
> flush the log itself, programs can use syncfs to amortize the log and
> cache flush across a bunch of file content exchanges.

Right - that's kinda my point - I was assuming that we'd be using
something like xchg-range as the "unaligned slow path" for
RWF_ATOMIC.

i.e. RWF_ATOMIC as implemented by a COW capable filesystem should
always be able to succeed regardless of IO alignment. In these
situations, the REQ_ATOMIC block layer offload to the hardware is a
fast path that is enabled when the user IO and filesystem extent
alignment matches the constraints needed to do a hardware atomic
write.

In all other cases, we implement RWF_ATOMIC something like
always-cow or prealloc-beyond-eof-then-xchg-range-on-io-completion
for anything that doesn't correctly align to hardware REQ_ATOMIC.

That said, there is nothing that prevents us from first implementing
RWF_ATOMIC constraints as "must match hardware requirements exactly"
and then relaxing them to be less stringent as filesystems
implementations improve. We've relaxed the direct IO hardware
alignment constraints multiple times over the years, so there's
nothing that really prevents us from doing so with RWF_ATOMIC,
either. Especially as we have statx to tell the application exactly
what alignment will get fast hardware offloads...

> Even better, if you still wanted to use untorn block writes to persist
> the temporary file's dirty data to disk, you don't even need forcealign
> because the exchange-range will take care of restarting the operation
> during log recovery.  I don't know that there's much point in doing that
> but the idea is there.

*nod*

> > Want to overwrite a bunch of small files safely?  Atomic write the
> > new data, then syncfs(). There's no need to run fdatasync after each
> > write to ensure individual files are not corrupted if we crash in
> > the middle of the operation. Indeed, atomic writes actually provide
> > better overwrite integrity semantics that fdatasync as it will be
> > all or nothing. fdatasync does not provide that guarantee if we
> > crash during the fdatasync operation.
> > 
> > Further, with COW data filesystems like XFS, btrfs and bcachefs, we
> > can emulate atomic writes for any size larger than what the hardware
> > supports.
> > 
> > At this point we actually provide app developers with what they've
> > been repeatedly asking kernel filesystem engineers to provide them
> > for the past 20 years: a way of overwriting arbitrary file data
> > safely without needing an expensive fdatasync operation on every
> > file that gets modified.
> > 
> > Put simply: atomic writes have a huge potential to fundamentally
> > change the way applications interact with Linux filesystems and to
> > make it *much* simpler for applications to safely overwrite user
> > data.  Hence there is an imperitive here to make the foundational
> > support for this technology solid and robust because atomic writes
> > are going to be with us for the next few decades...
> 
> I agree that we need to make the interface solid and robust, but I don't
> agree that the current RWF_ATOMIC, with its block-oriented storage
> device quirks is the way to go here.

> Maybe a byte-oriented RWF_ATOMIC
> would work, but the only way I can think of to do that is (say) someone
> implements Christoph's suggestion to change the COW code to allow
> multiple writes to a staging extent, and only commit the remapping
> operations at sync time... and you'd still have problems if you have to
> do multiple remappings if there's not also a way to restart the ioend
> chains.
> 
> Exchange-range already solved all of that, and it's already merged.

Yes, I agree that the block-device quirks need to go away from
RWF_ATOMIC, but I think it's the right interface for applications
that want to use atomic overwrite semantics.

Hiding exchange-range under the XFS covers for unaligned atomic IO
would mean applications won't need to target XFS specific ioctls to
do reliable atomic overwrites. i.e. the API really needs to be
simple and filesystem independent, and RWF_ATOMIC gives us that...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx