The flush that occurs just before xfs_truncate_page() during a non-extending truncate exists to avoid potential stale data exposure problems when iomap zeroing might be racing with buffered writes over unwritten extents. However, we've had reports of this causing significant performance regressions on overwrite workloads where the flush serves no correctness purpose. For example, the uuidd mechanism stores time metadata to a file on every generation sequence. This involves a buffered (over)write followed by a truncate of the file to its current size. If these uuids are used as transaction IDs for a database application, then overall performance can suffer tremendously by the repeated flushing on every truncate. To avoid this problem, update the truncate path to only flush in scenarios that are known to conflict with iomap zeroing. iomap skips zeroing when it sees a hole or unwritten extent, so this essentially means the filesystem should flush if either of those scenarios have outstanding dirty pagecache and can skip the flush otherwise. The ideal longer term solution here is to avoid the need to flush entirely and allow the zeroing to detect a dirty page and zero it accordingly, but this is a bit more involved in that it may involve the iomap interface. The purpose of this change is therefore to prioritize addressing the performance regression in a straightfoward enough manner that it can be separated from further improvements. Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> --- fs/xfs/xfs_iops.c | 44 ++++++++++++++++++++++++++++++++++++++------ 1 file changed, 38 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index d31e64db243f..37f78117557e 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -782,7 +782,15 @@ xfs_truncate_zeroing( xfs_off_t newsize, bool *did_zeroing) { + struct xfs_mount *mp = ip->i_mount; + struct inode *inode = VFS_I(ip); + struct xfs_ifork *ifp = ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK); + struct xfs_iext_cursor icur; + struct xfs_bmbt_irec got; + xfs_off_t end; + xfs_fileoff_t end_fsb = XFS_B_TO_FSBT(mp, newsize); int error; + bool found; if (newsize > oldsize) { trace_xfs_zero_eof(ip, oldsize, newsize - oldsize); @@ -790,16 +798,40 @@ xfs_truncate_zeroing( did_zeroing); } + /* + * No zeroing occurs if newsize is block aligned (or zero). The eof page + * is partially zeroed by the pagecache truncate, if necessary, and + * post-eof blocks are removed. + */ + if ((newsize & (i_blocksize(inode) - 1)) == 0) + return 0; + /* * iomap won't detect a dirty page over an unwritten block (or a cow * block over a hole) and subsequently skips zeroing the newly post-EOF - * portion of the page. Flush the new EOF to convert the block before - * the pagecache truncate. + * portion of the page. To ensure proper zeroing occurs, flush the eof + * page if it is dirty and backed by a hole or unwritten extent in the + * data fork. This ensures that iomap sees the eof block in a state that + * warrants zeroing. + * + * This should eventually be handled in iomap processing so we don't + * have to flush at all. We do it here for now to avoid the additional + * latency in cases where it's not absolutely required. */ - error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, newsize - 1, - newsize - 1); - if (error) - return error; + end = newsize - 1; + if (filemap_range_needs_writeback(inode->i_mapping, end, end)) { + xfs_ilock(ip, XFS_ILOCK_SHARED); + found = xfs_iext_lookup_extent(ip, ifp, end_fsb, &icur, &got); + xfs_iunlock(ip, XFS_ILOCK_SHARED); + + if (!found || got.br_startoff > end_fsb || + got.br_state == XFS_EXT_UNWRITTEN) { + error = filemap_write_and_wait_range(inode->i_mapping, + end, end); + if (error) + return error; + } + } return xfs_truncate_page(ip, newsize, did_zeroing); } -- 2.37.3