On Tue, Nov 15, 2011 at 03:14:12PM -0500, Christoph Hellwig wrote: > Do not use unlogged metadata updates and the VFS dirty bit for updating > the file size after writeback. In addition to causing various problems > with updates getting delayed for far too log this also drags in the > unscalable VFS dirty tracking, and is one of the few remaining unlogged > metadata updates. > > Note that we allocate a new transaction from the I/O completion handler. > While this sounds fairly dangerous it isn't an issue in practice given > that any appending write alreay had to start a transaction in writepages > to allocate blocks, and we'll start throtteling there if we run low on > log space or memory. > > We could still occasionally stall in the completion handler, but given > that we have per-filesystems workqueues for the I/O completions, > and completions that do not have to either convert unwritten extents > or update the file size are processed from interrupt context we do not > have to worry about this stalling a system to death. > > In addition to that implementing log reservations from ->writepage that > are only released by a different thread requires a lot of work, and even > with that wasn't quite doable in a deadlock-free manner. > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> .... > Index: linux-2.6/fs/xfs/xfs_file.c > =================================================================== > --- linux-2.6.orig/fs/xfs/xfs_file.c 2011-11-15 10:03:17.539965975 +0100 > +++ linux-2.6/fs/xfs/xfs_file.c 2011-11-15 18:43:05.059779662 +0100 > @@ -436,6 +436,36 @@ xfs_aio_write_isize_update( > } > } > > +STATIC int > +xfs_aio_write_isize_reset( > + struct xfs_inode *ip) > +{ > + struct xfs_mount *mp = ip->i_mount; > + struct xfs_trans *tp; > + int error = 0; > + > + tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS); > + error = xfs_trans_reserve(tp, 0, XFS_FSYNC_TS_LOG_RES(mp), 0, 0, 0); > + if (error) { > + xfs_trans_cancel(tp, 0); > + return error; > + } > + > + xfs_ilock(ip, XFS_ILOCK_EXCL); > + > + if (ip->i_d.di_size <= ip->i_size) { > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > + xfs_trans_cancel(tp, 0); > + return 0; > + } > + > + ip->i_d.di_size = ip->i_size; > + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > + > + return xfs_trans_commit(tp, 0); > +} > + > /* > * If this was a direct or synchronous I/O that failed (such as ENOSPC) then > * part of the I/O may have been written to disk before the error occurred. In > @@ -447,14 +477,18 @@ xfs_aio_write_newsize_update( > struct xfs_inode *ip, > xfs_fsize_t new_size) > { > + bool reset = false; > if (new_size == ip->i_new_size) { > xfs_rw_ilock(ip, XFS_ILOCK_EXCL); > if (new_size == ip->i_new_size) > ip->i_new_size = 0; > if (ip->i_d.di_size > ip->i_size) > - ip->i_d.di_size = ip->i_size; > + reset = true; > xfs_rw_iunlock(ip, XFS_ILOCK_EXCL); > } > + > + if (reset) > + xfs_aio_write_isize_reset(ip); I think this reset is racy. ip->i_size can change between the check and the reset, potentially leading to inconsistent values being logged during truncate/write error handling. Further, the reset transaction can fail, and hence we need to propagate the error back to the caller rather than dropping it on the ground. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs