Disk aligned (but not block aligned) DIO write woes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I observe that XFS takes an exclusive lock for DIO writes that are not block aligned:


xfs_file_dio_aio_write(

{

...

       /*
         * Don't take the exclusive iolock here unless the I/O is unaligned to
         * the file system block size.  We don't need to consider the EOF
         * extension case here because xfs_file_aio_write_checks() will relock          * the inode as necessary for EOF zeroing cases and fill out the new
         * inode size as appropriate.
         */
        if ((iocb->ki_pos & mp->m_blockmask) ||
            ((iocb->ki_pos + count) & mp->m_blockmask)) {
                unaligned_io = 1;

                /*
                 * We can't properly handle unaligned direct I/O to reflink
                 * files yet, as we can't unshare a partial block.
                 */
                if (xfs_is_cow_inode(ip)) {
                        trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, count);
                        return -ENOTBLK;
                }
                iolock = XFS_IOLOCK_EXCL;
        } else {
                iolock = XFS_IOLOCK_SHARED;
        }


I also see that such writes cause io_submit to block, even if they hit a written extent (and are also not size-changing, by implication) and therefore do not require a metadata write. Probably due to "|| unaligned_io" in


        ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
                           &xfs_dio_write_ops,
                           is_sync_kiocb(iocb) || unaligned_io);


Can this be relaxed to allow writes to written extents to proceed in parallel? I explain the motivation below.


My thinking (from a position of blissful ignorance) is that if the extent is already written, then no metadata changes and block zeroing are needed. If we can detect that favorable conditions exists (perhaps with the extra constraint that the mapping be already cached), then we can handle this particular case asynchronously.


My motivation is a database commit log. NVMe drives can serve small writes with ridiculously low latency - around 20 microseconds. Let's say a commitlog entry is around 100 bytes; we fill a 4k block with 41 entries. To achieve that in 20 microseconds requires 2 million records/sec. Even if we add artificial delay and commit every 1ms, filling this 4k block require 41,000 commits/sec. If the entry write rate is lower, then we will be forced to pad the rest of the block. This increases the write amplification, impacting other activities using the disk (such as reads).


41,000 commits/sec may not sound like much, but in a thread-per-core design (where each core commits independently) this translates to millions of commits per second for the entire machine. If the real throughput is below that, then we are forced to either increase the latency to collect more writes into a full block, or we have to tolerate the increased write amplification.





[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux