Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Jun 01, 2019 at 08:45:49AM +1000, Dave Chinner wrote:
> Given that we can already use AIO to provide this sort of ordering,
> and AIO is vastly faster than synchronous IO, I don't see any point
> in adding complex barrier interfaces that can be /easily implemented
> in userspace/ using existing AIO primitives. You should start
> thinking about expanding libaio with stuff like
> "link_after_fdatasync()" and suddenly the whole problem of
> filesystem data vs metadata ordering goes away because the
> application directly controls all ordering without blocking and
> doesn't need to care what the filesystem under it does....

And let me point out that this is also how userspace can do an
efficient atomic rename - rename_after_fdatasync(). i.e. on
completion of the AIO_FSYNC, run the rename. This guarantees that
the application will see either the old file of the complete new
file, and it *doesn't have to wait for the operation to complete*.
Once it is in flight, the file will contain the old data until some
point in the near future when will it contain the new data....

Seriously, sit down and work out all the "atomic" data vs metadata
behaviours you want, and then tell me how many of them cannot be
implemented as "AIO_FSYNC w/ completion callback function" in
userspace. This mechanism /guarantees ordering/ at the application
level, the application does not block waiting for these data
integrity operations to complete, and you don't need any new kernel
side functionality to implement this.

Fundamentally, the assertion that disk cache flushes are not what
causes fsync "to be slow" is incorrect. It's the synchronous
"waiting for IO completion" that makes fsync "slow". AIO_FSYNC
avoids needing to wait for IO completion, allowing the application
to do useful work (like issue more DI ops) while data integrity
operations are in flight. At this point, fsync is no longer a "slow"
operation - it's just another background async data flush operation
like the BDI flusher thread...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux