On Mon, Nov 19, 2012 at 11:51:14PM -0800, Darrick J. Wong wrote: > If a file is opened with O_SYNC|O_DIRECT, the drive cache does not get > flushed after the write completion for AIOs. This patch attempts to fix > that problem by marking an I/O as requiring a cache flush in endio > processing, and then issuing the cache flush after any unwritten extent > conversion is done. > > From: Jeff Moyer <jmoyer@xxxxxxxxxx> > Signed-off-by: Jeff Moyer <jmoyer@xxxxxxxxxx> > [darrick.wong@xxxxxxxxxx: Rework patch to use per-mount workqueues] > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > --- > fs/xfs/xfs_aops.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++- > fs/xfs/xfs_aops.h | 1 + > fs/xfs/xfs_mount.h | 1 + > fs/xfs/xfs_super.c | 8 ++++++++ > 4 files changed, 61 insertions(+), 1 deletion(-) > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index e57e2da..9cebbb7 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -173,6 +173,24 @@ xfs_setfilesize( > } > > /* > + * In the case of synchronous, AIO, O_DIRECT writes, we need to flush > + * the disk cache when the I/O is complete. > + */ That's not quite right - we need to fsync the metadata when the data IO is complete. Block device/disk cache flushes are irrelevant at this level as that is wholly encapsulated inside the metadata fsync processing. > +STATIC bool > +xfs_ioend_needs_cache_flush( xfs_ioend_need_fsync() > + struct xfs_ioend *ioend) > +{ > + struct xfs_inode *ip = XFS_I(ioend->io_inode); > + struct xfs_mount *mp = ip->i_mount; > + > + if (!(mp->m_flags & XFS_MOUNT_BARRIER)) > + return false; And regardless of whether we have barriers enabled or not, we need to flush the dirty metadata to the log for O_SYNC/O_DSYNC+AIO+DIO writes here. So there should be no check of the mount flags here. > + return IS_SYNC(ioend->io_inode) || > + (ioend->io_iocb->ki_filp->f_flags & O_DSYNC); > +} > + > +/* > * Schedule IO completion handling on the final put of an ioend. > * > * If there is no work to do we might as well call it a day and free the > @@ -189,11 +207,30 @@ xfs_finish_ioend( > queue_work(mp->m_unwritten_workqueue, &ioend->io_work); > else if (ioend->io_append_trans) > queue_work(mp->m_data_workqueue, &ioend->io_work); > + else if (ioend->io_needs_fsync) > + queue_work(mp->m_aio_blkdev_flush_wq, &ioend->io_work); > else > xfs_destroy_ioend(ioend); > } > } > > +STATIC int > +xfs_ioend_force_cache_flush( STATIC void xfs_ioend_fsync() > + xfs_ioend_t *ioend) > +{ > + struct xfs_inode *ip = XFS_I(ioend->io_inode); > + struct xfs_mount *mp = ip->i_mount; > + int err = 0; > + int datasync; trace_xfs_ioend_fsync(ip); > + datasync = !IS_SYNC(ioend->io_inode) && > + !(ioend->io_iocb->ki_filp->f_flags & __O_SYNC); > + err = do_xfs_file_fsync(ip, mp, datasync); > + xfs_destroy_ioend(ioend); I think this is wrong. The ioend is destroyed by the caller, so putting it here turns all subsequent uses by the caller into use-after-free memory corruption bugs. > + /* do_xfs_file_fsync returns -errno. our caller expects positive. */ > + return -err; This is why xfs_do_file_fsync() should be returning a positive error - to avoid repeated juggling of error signs. > +} > + > /* > * IO write completion. > */ > @@ -250,12 +287,22 @@ xfs_end_io( > error = xfs_setfilesize(ioend); > if (error) > ioend->io_error = -error; > + } else if (ioend->io_needs_fsync) { > + error = xfs_ioend_force_cache_flush(ioend); > + if (error && ioend->io_result > 0) > + ioend->io_error = -error; > + ioend->io_needs_fsync = 0; So this is all use-after-free. Also, there's no need to clear io_needs_fsync() as the ioend is about to be destroyed. > } else { > ASSERT(!xfs_ioend_is_append(ioend)); > } > > done: > - xfs_destroy_ioend(ioend); > + /* the honoring of O_SYNC has to be done last */ > + if (ioend->io_needs_fsync) { > + atomic_inc(&ioend->io_remaining); > + xfs_finish_ioend(ioend); > + } else > + xfs_destroy_ioend(ioend); And requeuing work from one workqueue to the next is something that we can avoid. We know at IO submission time (i.e. xfs_vm_direct_io)) whether an fsync completion is going to be needed during Io completion. The ioend->io_needs_fsync flag can be set then, and the first pass through xfs_finish_ioend() can queue it to the correct workqueue. i.e. it only needs to be queued if it's not already an unwritten or append ioend and it needs an fsync. As it is, all the data completion workqueues run the same completion function so all you need to do is handle the fsync case at the end of the existing processing - it's not an else case. i.e the end of xfs_end_io() becomes: if (ioend->io_needs_fsync) { error = xfs_ioend_fsync(ioend); if (error) ioend->io_error = -error; goto done; } done: xfs_destroy_ioend(ioend); As it is, this code is going to change before these changes go in - there's a nasty regression in the DIO code that I found this afternoon that requires reworking this IO completion logic to avoid. The patch will appear on the list soon.... > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -209,6 +209,7 @@ typedef struct xfs_mount { > struct workqueue_struct *m_data_workqueue; > struct workqueue_struct *m_unwritten_workqueue; > struct workqueue_struct *m_cil_workqueue; > + struct workqueue_struct *m_aio_blkdev_flush_wq; struct workqueue_struct *m_aio_fsync_wq; Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html