On Mon, Jan 11, 2021 at 05:12:12PM +0100, Christoph Hellwig wrote: > We only need the exclusive iolock for direct writes to protect sub-block > zeroing after an allocation or conversion of unwritten extents, and the > synchronous execution of these writes is also only needed because the > iolock is dropped early for the dodgy i_dio_count synchronisation. > > Always start out with the shared iolock in xfs_file_dio_aio_write for > non-appending writes and only upgrade it to exclusive if the start and > end of the write range are not already allocated and in written > state. This means one or two extra lookups in the in-core extent tree, > but with our btree data structure those lookups are very cheap and do > not show up in profiles on NVMe hardware for me. On the other hand > avoiding the lock allows for a high concurrency using aio or io_uring. > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > Reported-by: Avi Kivity <avi@xxxxxxxxxxxx> > Suggested-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > fs/xfs/xfs_file.c | 127 +++++++++++++++++++++++++++++++++++----------- > 1 file changed, 96 insertions(+), 31 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 1470fc4f2e0255..59d4c6e90f06c1 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -521,6 +521,57 @@ static const struct iomap_dio_ops xfs_dio_write_ops = { > .end_io = xfs_dio_write_end_io, > }; > > +static int > +xfs_dio_write_exclusive( > + struct kiocb *iocb, > + size_t count, > + bool *exclusive_io) > +{ > + struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp)); > + struct xfs_mount *mp = ip->i_mount; > + struct xfs_ifork *ifp = &ip->i_df; > + loff_t offset = iocb->ki_pos; > + loff_t end = offset + count; > + xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); > + xfs_fileoff_t end_fsb = XFS_B_TO_FSB(mp, end); > + struct xfs_bmbt_irec got = { }; > + struct xfs_iext_cursor icur; > + int ret; > + > + *exclusive_io = true; > + > + /* > + * Bmap information not read in yet or no blocks allocated at all? > + */ > + if (!(ifp->if_flags & XFS_IFEXTENTS) || !ip->i_d.di_nblocks) > + return 0; > + > + ret = xfs_ilock_iocb(iocb, XFS_ILOCK_SHARED); > + if (ret) > + return ret; > + > + if (offset & mp->m_blockmask) { > + if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &icur, &got) || > + got.br_startoff > offset_fsb || > + got.br_state == XFS_EXT_UNWRITTEN) > + goto out_unlock; > + } > + > + if ((end & mp->m_blockmask) && > + got.br_startoff + got.br_blockcount <= end_fsb) { > + if (!xfs_iext_lookup_extent(ip, ifp, end_fsb, &icur, &got) || > + got.br_startoff > end_fsb || > + got.br_state == XFS_EXT_UNWRITTEN) > + goto out_unlock; > + } > + > + *exclusive_io = false; > + > +out_unlock: > + xfs_iunlock(ip, XFS_ILOCK_SHARED); > + return ret; > +} OK, this has a bug in it. We have to do exclusive sub-block IO when the end of the IO is >= i_size_read(inode) because in this situation iomap_dio_rw_actor() will issue an extra bio to zero the portion of the filesystem block that lies beyond EOF. generic/418 hammers this case, and randomly ends up with zeroes where there should be data. Otherwise, it largely makes a similar mess of xfs_file_dio_aio_write() as my initial patchset does. We should just move the unaligned IO out of the main path, regardless of whether it is exclusive or not. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx