Re: Question about non asynchronous aio calls.

Avi Kivity <avi@xxxxxxxxxxxx> · Mon, 12 Oct 2015 15:37:04 +0300

On 10/08/2015 02:46 PM, Dave Chinner wrote:
On Thu, Oct 08, 2015 at 11:23:07AM +0300, Gleb Natapov wrote:
On Thu, Oct 08, 2015 at 08:21:58AM +0300, Avi Kivity wrote:
I fixed something similar in ext4 at the time, FWIW.
Makes sense.

Is there a way to relax this for reads?
The above mostly only applies to writes. Reads don't modify data so
racing unaligned reads against other reads won't given unexpected
results and so aren't serialised.

i.e. serialisation will only occur when:
	- unaligned write IO will serialise until sub-block zeroing
	  is complete.
	- write IO extending EOF will serialis until post-EOF
	  zeroing is complete

By "complete" here, do you mean that a call to truncate() returned, or that
its results reached the disk an unknown time later?

No, I'm talking purely about DIO here. If you do write that
starts beyond the existing EOF, there is a region between the
current EOF and the offset the write starts at. i.e.

    0             EOF            offset     new EOF
    +dddddddddddddd+..............+nnnnnnnnnnn+

It is the region between EOF and offset that we must ensure is made
up of either holes, unwritten extents or fully zeroed blocks before
allowing the write to proceed. If we have to zero allocated blocks,
then we have to ensure that completes before the write can start.
This means that when we update the EOF on completion of the write,
we don't expose stale data in blocks that were between EOF and
offset...

Thanks.  We found, experimentally, that io_submit(write_at_eof) followed 
by (without waiting) io_submit(write_at_what_would_be_the_new_eof) 
occasionally blocks.

So I guess we have to employ a train algorithm here and keep at most one 
aio in flight for append loads (which are very common for us).

I think Brian already answered that one with:

   There are no such pitfalls as far as I'm aware. The entire AIO
   submission synchronization sequence triggers off an in-memory i_size
   check in xfs_file_aio_write_checks(). The in-memory i_size is updated in
   the truncate path (xfs_setattr_size()) via truncate_setsize(), so at
   that point the new size should be visible to subsequent AIO writers.
Different situation as truncate serialises all IO. Extending the file
via truncate also runs the same "EOF zeroing" that the DIO code runs
above, for the same reasons.

Does that mean that truncate() will wait for inflight aios, or that new 
aios will wait for the truncate() to complete, or both?

	- truncate/extent manipulation syscall is run
Actually, we do call fallocate() ahead of io_submit() (in a worker thread,
in non-overlapping ranges) to optimize file layout and also in the belief
that it would reduce the amount of blocking io_submit() does.
fallocate serialises all IO submission - including reads. Unlike
truncate, however, it doesn't drain the queue of IO for
preallocation so the impact on AIO is somewhat limited.

Ideally you want to limit fallocate calls to large chunks at a time.
If you have a 1:1 mapping of fallocate calls to write calls, then
you're likely making things worse for the AIO submission path
because you'll block reads as well as writes. Doing the allocation
in the write submission path will not block reads, and only writes
that are attempting to do concurrent allocations to the same file
will serialise...

We have a 1:8 ratio (128K:1M), but that's just random numbers we guessed.

Again, not only for reduced xfs metadata, but also to reduce the amount 
of write amplification done by the FTL. We have a concurrent append 
workload on many files, and files are reclaimed out of order, so larger 
extends means less fragmentation for the FTL later on.

If you want to limit fragmentation without adding and overhead on
XFS for non-sparse files (which it sounds like your case), then the
best thing to use in XFS is the per-inode extent size hints. You set
it on the file when first creating it (or the parent directory so
all children inherit it at create), and then the allocator will
round out allocations to the size hint alignment and size, including
beyond EOF so appending writes can take advantage of it....

We'll try that out.  That's fsxattr::fsx_extsize?

What about small files that are eventually closed, do I need to do 
anything to reclaim the preallocated space?

A final point is discoverability.  There is no way to discover safe
alignment for reads and writes, and which operations block io_submit(),
except by asking here, which cannot be done at runtime.  Interfaces that
provide a way to query these attributes are very important to us.
As Brian pointed statfs() can be use to get f_bsize which is defined as
"optimal transfer block size".
Well, that's what posix calls it. It's not really the optimal IO
size, though, it's just the IO size that avoids page cache RMW
cycles. For direct IO, larger tends to be better, and IO aligned to
the underlying geometry of the storage is even better. See, for
example, the "largeio" mount option, which will make XFS report the
stripe width in f_bsize rather than the PAGE_SIZE of the machine....

Well, random reads will still be faster with 512 byte alignment, yes? 
and for random writes, you can't just make those I/Os larger, you'll 
overwrite something.

So I read "optimal" here to mean "smallest I/O size that doesn't incur a 
penalty; but if you really need more data, making it larger will help".

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs