Disk aligned (but not block aligned) DIO write woes

Avi Kivity <avi@xxxxxxxxxxxx> · Mon, 28 Dec 2020 17:57:29 +0200

I observe that XFS takes an exclusive lock for DIO writes that are not 
block aligned:

xfs_file_dio_aio_write(

{

...

       /*
         * Don't take the exclusive iolock here unless the I/O is 
unaligned to
         * the file system block size.  We don't need to consider the EOF
         * extension case here because xfs_file_aio_write_checks() will 
relock
         * the inode as necessary for EOF zeroing cases and fill out 
the new
         * inode size as appropriate.
         */
        if ((iocb->ki_pos & mp->m_blockmask) ||
            ((iocb->ki_pos + count) & mp->m_blockmask)) {
                unaligned_io = 1;

                /*
                 * We can't properly handle unaligned direct I/O to reflink
                 * files yet, as we can't unshare a partial block.
                 */
                if (xfs_is_cow_inode(ip)) {
                        trace_xfs_reflink_bounce_dio_write(ip, 
iocb->ki_pos, count);
                        return -ENOTBLK;
                }
                iolock = XFS_IOLOCK_EXCL;
        } else {
                iolock = XFS_IOLOCK_SHARED;
        }

I also see that such writes cause io_submit to block, even if they hit a 
written extent (and are also not size-changing, by implication) and 
therefore do not require a metadata write. Probably due to "|| 
unaligned_io" in

        ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
                           &xfs_dio_write_ops,
                           is_sync_kiocb(iocb) || unaligned_io);

Can this be relaxed to allow writes to written extents to proceed in 
parallel? I explain the motivation below.

My thinking (from a position of blissful ignorance) is that if the 
extent is already written, then no metadata changes and block zeroing 
are needed. If we can detect that favorable conditions exists (perhaps 
with the extra constraint that the mapping be already cached), then we 
can handle this particular case asynchronously.

My motivation is a database commit log. NVMe drives can serve small 
writes with ridiculously low latency - around 20 microseconds. Let's say 
a commitlog entry is around 100 bytes; we fill a 4k block with 41 
entries. To achieve that in 20 microseconds requires 2 million 
records/sec. Even if we add artificial delay and commit every 1ms, 
filling this 4k block require 41,000 commits/sec. If the entry write 
rate is lower, then we will be forced to pad the rest of the block. This 
increases the write amplification, impacting other activities using the 
disk (such as reads).

41,000 commits/sec may not sound like much, but in a thread-per-core 
design (where each core commits independently) this translates to 
millions of commits per second for the entire machine. If the real 
throughput is below that, then we are forced to either increase the 
latency to collect more writes into a full block, or we have to tolerate 
the increased write amplification.