I observe that XFS takes an exclusive lock for DIO writes that are not
block aligned:
xfs_file_dio_aio_write(
{
...
/*
* Don't take the exclusive iolock here unless the I/O is
unaligned to
* the file system block size. We don't need to consider the EOF
* extension case here because xfs_file_aio_write_checks() will
relock
* the inode as necessary for EOF zeroing cases and fill out
the new
* inode size as appropriate.
*/
if ((iocb->ki_pos & mp->m_blockmask) ||
((iocb->ki_pos + count) & mp->m_blockmask)) {
unaligned_io = 1;
/*
* We can't properly handle unaligned direct I/O to reflink
* files yet, as we can't unshare a partial block.
*/
if (xfs_is_cow_inode(ip)) {
trace_xfs_reflink_bounce_dio_write(ip,
iocb->ki_pos, count);
return -ENOTBLK;
}
iolock = XFS_IOLOCK_EXCL;
} else {
iolock = XFS_IOLOCK_SHARED;
}
I also see that such writes cause io_submit to block, even if they hit a
written extent (and are also not size-changing, by implication) and
therefore do not require a metadata write. Probably due to "||
unaligned_io" in
ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
&xfs_dio_write_ops,
is_sync_kiocb(iocb) || unaligned_io);
Can this be relaxed to allow writes to written extents to proceed in
parallel? I explain the motivation below.
My thinking (from a position of blissful ignorance) is that if the
extent is already written, then no metadata changes and block zeroing
are needed. If we can detect that favorable conditions exists (perhaps
with the extra constraint that the mapping be already cached), then we
can handle this particular case asynchronously.
My motivation is a database commit log. NVMe drives can serve small
writes with ridiculously low latency - around 20 microseconds. Let's say
a commitlog entry is around 100 bytes; we fill a 4k block with 41
entries. To achieve that in 20 microseconds requires 2 million
records/sec. Even if we add artificial delay and commit every 1ms,
filling this 4k block require 41,000 commits/sec. If the entry write
rate is lower, then we will be forced to pad the rest of the block. This
increases the write amplification, impacting other activities using the
disk (such as reads).
41,000 commits/sec may not sound like much, but in a thread-per-core
design (where each core commits independently) this translates to
millions of commits per second for the entire machine. If the real
throughput is below that, then we are forced to either increase the
latency to collect more writes into a full block, or we have to tolerate
the increased write amplification.