Re: Disk aligned (but not block aligned) DIO write woes

Avi Kivity <avi@xxxxxxxxxxxx> · Mon, 4 Jan 2021 17:16:48 +0200

On 04/01/2021 17.06, Brian Foster wrote:
On Mon, Dec 28, 2020 at 05:57:29PM +0200, Avi Kivity wrote:
I observe that XFS takes an exclusive lock for DIO writes that are not block
aligned:

xfs_file_dio_aio_write(

{

...

        /*
          * Don't take the exclusive iolock here unless the I/O is unaligned
to
          * the file system block size.  We don't need to consider the EOF
          * extension case here because xfs_file_aio_write_checks() will
relock
          * the inode as necessary for EOF zeroing cases and fill out the new
          * inode size as appropriate.
          */
         if ((iocb->ki_pos & mp->m_blockmask) ||
             ((iocb->ki_pos + count) & mp->m_blockmask)) {
                 unaligned_io = 1;

                 /*
                  * We can't properly handle unaligned direct I/O to reflink
                  * files yet, as we can't unshare a partial block.
                  */
                 if (xfs_is_cow_inode(ip)) {
                         trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos,
count);
                         return -ENOTBLK;
                 }
                 iolock = XFS_IOLOCK_EXCL;
         } else {
                 iolock = XFS_IOLOCK_SHARED;
         }

I also see that such writes cause io_submit to block, even if they hit a
written extent (and are also not size-changing, by implication) and
therefore do not require a metadata write. Probably due to "|| unaligned_io"
in

         ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
                            &xfs_dio_write_ops,
                            is_sync_kiocb(iocb) || unaligned_io);

Can this be relaxed to allow writes to written extents to proceed in
parallel? I explain the motivation below.

 From the above code, it looks as though we require the exclusive lock to
accommodate sub-block zeroing that might occur down in the dio code.
Without it, concurrent sector granular I/Os to the same block could
presumably cause corruption if the data bios and associated zeroing bios
were reordered between two requests.

Looking down in the iomap/dio code, iomap_dio_bio_actor() triggers
sub-block zeroing if the associated range is unaligned and the mapping
is either unwritten or new (i.e. newly allocated for this write). So as
you note above, there is no such zeroing for a pre-existing written
block within EOF. From that, it seems reasonable to me that in principle
the filesystem could check for these conditions in the higher layer and
further restrict usage of the exclusive lock. That said, we'd probably
have to rework the above code to acquire the shared lock first,
read/check the extent state for the start/end blocks of the I/O and then
cycle out to the exclusive lock in the cases where it is required. It's
not immediately clear to me if we'd still want to set the wait flag in
those unaligned-but-no-zeroing cases. Perhaps somebody who knows the dio
code a bit better can chime in further on all of this..

Thanks. I'll try to convince someone to attempt to do this.

Meanwhile we'll try using -b size=1024. I have hopes that the 
extent-based nature of the filesystem will not cause this to increase 
overhead too much.