On Sat, Jun 01, 2019 at 08:45:49AM +1000, Dave Chinner wrote: > Given that we can already use AIO to provide this sort of ordering, > and AIO is vastly faster than synchronous IO, I don't see any point > in adding complex barrier interfaces that can be /easily implemented > in userspace/ using existing AIO primitives. You should start > thinking about expanding libaio with stuff like > "link_after_fdatasync()" and suddenly the whole problem of > filesystem data vs metadata ordering goes away because the > application directly controls all ordering without blocking and > doesn't need to care what the filesystem under it does.... And let me point out that this is also how userspace can do an efficient atomic rename - rename_after_fdatasync(). i.e. on completion of the AIO_FSYNC, run the rename. This guarantees that the application will see either the old file of the complete new file, and it *doesn't have to wait for the operation to complete*. Once it is in flight, the file will contain the old data until some point in the near future when will it contain the new data.... Seriously, sit down and work out all the "atomic" data vs metadata behaviours you want, and then tell me how many of them cannot be implemented as "AIO_FSYNC w/ completion callback function" in userspace. This mechanism /guarantees ordering/ at the application level, the application does not block waiting for these data integrity operations to complete, and you don't need any new kernel side functionality to implement this. Fundamentally, the assertion that disk cache flushes are not what causes fsync "to be slow" is incorrect. It's the synchronous "waiting for IO completion" that makes fsync "slow". AIO_FSYNC avoids needing to wait for IO completion, allowing the application to do useful work (like issue more DI ops) while data integrity operations are in flight. At this point, fsync is no longer a "slow" operation - it's just another background async data flush operation like the BDI flusher thread... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx