Re: Disk aligned (but not block aligned) DIO write woes

Brian Foster <bfoster@xxxxxxxxxx> · Mon, 4 Jan 2021 10:06:01 -0500

On Mon, Dec 28, 2020 at 05:57:29PM +0200, Avi Kivity wrote:
> I observe that XFS takes an exclusive lock for DIO writes that are not block
> aligned:
> 
> 
> xfs_file_dio_aio_write(
> 
> {
> 
> ...
> 
>        /*
>          * Don't take the exclusive iolock here unless the I/O is unaligned
> to
>          * the file system block size.  We don't need to consider the EOF
>          * extension case here because xfs_file_aio_write_checks() will
> relock
>          * the inode as necessary for EOF zeroing cases and fill out the new
>          * inode size as appropriate.
>          */
>         if ((iocb->ki_pos & mp->m_blockmask) ||
>             ((iocb->ki_pos + count) & mp->m_blockmask)) {
>                 unaligned_io = 1;
> 
>                 /*
>                  * We can't properly handle unaligned direct I/O to reflink
>                  * files yet, as we can't unshare a partial block.
>                  */
>                 if (xfs_is_cow_inode(ip)) {
>                         trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos,
> count);
>                         return -ENOTBLK;
>                 }
>                 iolock = XFS_IOLOCK_EXCL;
>         } else {
>                 iolock = XFS_IOLOCK_SHARED;
>         }
> 
> 
> I also see that such writes cause io_submit to block, even if they hit a
> written extent (and are also not size-changing, by implication) and
> therefore do not require a metadata write. Probably due to "|| unaligned_io"
> in
> 
> 
>         ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
>                            &xfs_dio_write_ops,
>                            is_sync_kiocb(iocb) || unaligned_io);
> 
> 
> Can this be relaxed to allow writes to written extents to proceed in
> parallel? I explain the motivation below.
> 

>From the above code, it looks as though we require the exclusive lock to
accommodate sub-block zeroing that might occur down in the dio code.
Without it, concurrent sector granular I/Os to the same block could
presumably cause corruption if the data bios and associated zeroing bios
were reordered between two requests.

Looking down in the iomap/dio code, iomap_dio_bio_actor() triggers
sub-block zeroing if the associated range is unaligned and the mapping
is either unwritten or new (i.e. newly allocated for this write). So as
you note above, there is no such zeroing for a pre-existing written
block within EOF. From that, it seems reasonable to me that in principle
the filesystem could check for these conditions in the higher layer and
further restrict usage of the exclusive lock. That said, we'd probably
have to rework the above code to acquire the shared lock first,
read/check the extent state for the start/end blocks of the I/O and then
cycle out to the exclusive lock in the cases where it is required. It's
not immediately clear to me if we'd still want to set the wait flag in
those unaligned-but-no-zeroing cases. Perhaps somebody who knows the dio
code a bit better can chime in further on all of this..

Brian

> 
> My thinking (from a position of blissful ignorance) is that if the extent is
> already written, then no metadata changes and block zeroing are needed. If
> we can detect that favorable conditions exists (perhaps with the extra
> constraint that the mapping be already cached), then we can handle this
> particular case asynchronously.
> 
> 
> My motivation is a database commit log. NVMe drives can serve small writes
> with ridiculously low latency - around 20 microseconds. Let's say a
> commitlog entry is around 100 bytes; we fill a 4k block with 41 entries. To
> achieve that in 20 microseconds requires 2 million records/sec. Even if we
> add artificial delay and commit every 1ms, filling this 4k block require
> 41,000 commits/sec. If the entry write rate is lower, then we will be forced
> to pad the rest of the block. This increases the write amplification,
> impacting other activities using the disk (such as reads).
> 
> 
> 41,000 commits/sec may not sound like much, but in a thread-per-core design
> (where each core commits independently) this translates to millions of
> commits per second for the entire machine. If the real throughput is below
> that, then we are forced to either increase the latency to collect more
> writes into a full block, or we have to tolerate the increased write
> amplification.
> 
>