Re: [PATCH 5/5] xfs: log file size updates at I/O completion time

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 18 Nov 2011 14:44:11 +1100

On Tue, Nov 15, 2011 at 03:14:12PM -0500, Christoph Hellwig wrote:
> Do not use unlogged metadata updates and the VFS dirty bit for updating
> the file size after writeback.  In addition to causing various problems
> with updates getting delayed for far too log this also drags in the
> unscalable VFS dirty tracking, and is one of the few remaining unlogged
> metadata updates.
> 
> Note that we allocate a new transaction from the I/O completion handler.
> While this sounds fairly dangerous it isn't an issue in practice given
> that any appending write alreay had to start a transaction in writepages
> to allocate blocks, and we'll start throtteling there if we run low on
> log space or memory.
> 
> We could still occasionally stall in the completion handler, but given
> that we have per-filesystems workqueues for the I/O completions,
> and completions that do not have to either convert unwritten extents
> or update the file size are processed from interrupt context we do not
> have to worry about this stalling a system to death.
> 
> In addition to that implementing log reservations from ->writepage that
> are only released by a different thread requires a lot of work, and even
> with that wasn't quite doable in a deadlock-free manner.
> 
> Signed-off-by: Christoph Hellwig <hch@xxxxxx>

....
> Index: linux-2.6/fs/xfs/xfs_file.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/xfs_file.c	2011-11-15 10:03:17.539965975 +0100
> +++ linux-2.6/fs/xfs/xfs_file.c	2011-11-15 18:43:05.059779662 +0100
> @@ -436,6 +436,36 @@ xfs_aio_write_isize_update(
>  	}
>  }
>  
> +STATIC int
> +xfs_aio_write_isize_reset(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_trans	*tp;
> +	int			error = 0;
> +
> +	tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
> +	error = xfs_trans_reserve(tp, 0, XFS_FSYNC_TS_LOG_RES(mp), 0, 0, 0);
> +	if (error) {
> +		xfs_trans_cancel(tp, 0);
> +		return error;
> +	}
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +
> +	if (ip->i_d.di_size <= ip->i_size) {
> +		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +		xfs_trans_cancel(tp, 0);
> +		return 0;
> +	}
> +
> +	ip->i_d.di_size = ip->i_size;
> +	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> +	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +
> +	return xfs_trans_commit(tp, 0);
> +}
> +
>  /*
>   * If this was a direct or synchronous I/O that failed (such as ENOSPC) then
>   * part of the I/O may have been written to disk before the error occurred.  In
> @@ -447,14 +477,18 @@ xfs_aio_write_newsize_update(
>  	struct xfs_inode	*ip,
>  	xfs_fsize_t		new_size)
>  {
> +	bool			reset = false;
>  	if (new_size == ip->i_new_size) {
>  		xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
>  		if (new_size == ip->i_new_size)
>  			ip->i_new_size = 0;
>  		if (ip->i_d.di_size > ip->i_size)
> -			ip->i_d.di_size = ip->i_size;
> +			reset = true;
>  		xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
>  	}
> +
> +	if (reset)
> +		xfs_aio_write_isize_reset(ip);

I think this reset is racy. ip->i_size can change between the check
and the reset, potentially leading to inconsistent values being
logged during truncate/write error handling.

Further, the reset transaction can fail, and hence we need to
propagate the error back to the caller rather than dropping it on
the ground.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs