The pre-size update flush logic in xfs_setattr_size() has become warped enough that it's difficult to surmise how it is actually supposed to work. The original purpose of this flush is to mitigate the "NULL file" problem when a truncate logs a size update before preceding writes have been flushed and the fs happens to crash. This code has seen several incremental changes since then that alter behavior in potentially unexpected ways. For context, the first in line is commit 5885ebda878b ("xfs: ensure truncate forces zeroed blocks to disk"). This introduced the zeroing check specifically for extending truncates and seems straightforward enough of a change to accomplish that correctly. Next, commit 68a9f5e7007c ("xfs: implement iomap based buffered write path") switched over to iomap and introduced did_zeroing for the truncate down case. This didn't change the flush range offsets, however, which means partial post-eof zeroing on a truncate down may only flush if the on disk inode size hadn't been updated to reflect the in-core size at the start of the truncate. Sometime after that, commit 350976ae2187 ("xfs: truncate pagecache before writeback in xfs_setattr_size()") reordered the flush to after the pagecache truncate to prevent a stale data exposure due to iomap not zeroing properly. This failed to account for the fact that pagecache truncate doesn't mark pages dirty and thus leaves the filesystem responsible for on-disk consistency. Therefore, post-eof data exposure was still possible if a dirty page was cleaned before pagecache truncate. This also introduced an off by one issue for the newsize == oldsize scenario which causes the flush to submit the EOF page for I/O, but not actually wait on it if the offsets align to the same page. Finally, commit 869ae85dae64 ("xfs: flush new eof page on truncate to avoid post-eof corruption") came along to address the aforementioned stale data exposure race. This fails to account for the same scenario on extending truncates, for one, but can also work against the NULL file detection logic the flush was introduced to mitigate the first place. This is because selectively flushing the EOF block can update on-disk size before any preceding dirty data may have been written back. Since it is confusing enough to assess intent of the current code and the various ways it might or might not be broken, this patch just assumes we want to flush any combination of block zeroing or previous I/O patterns deemed susceptible to the NULL file problem, and then tries to do that correctly. Note that the EOF block flush cannot be removed without reintroducing the data exposure race, but that problem is mitigated by a separate patch that moves the flush out of truncate and into iomap processing callbacks such that it is no longer unconditional. Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> --- fs/xfs/xfs_iops.c | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 2e10e1c66ad6..39e9088cd32c 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -785,6 +785,7 @@ xfs_setattr_size( struct xfs_mount *mp = ip->i_mount; struct inode *inode = VFS_I(ip); xfs_off_t oldsize, newsize; + loff_t flushoff = LLONG_MAX; struct xfs_trans *tp; int error; uint lock_flags = 0; @@ -880,17 +881,24 @@ xfs_setattr_size( truncate_setsize(inode, newsize); /* - * We are going to log the inode size change in this transaction so - * any previous writes that are beyond the on disk EOF and the new - * EOF that have not been written out need to be written here. If we - * do not write the data out, we expose ourselves to the null files - * problem. Note that this includes any block zeroing we did above; - * otherwise those blocks may not be zeroed after a crash. + * We are going to log the inode size change in this transaction so any + * previous writes that are beyond the on disk EOF and the new EOF that + * have not been written out need to be written here. If we do not write + * the data out, we expose ourselves to the null files problem. + * + * To also cover block zeroing performed above, start with the lowest of + * the old and new sizes to handle truncate up or down, and then factor + * in ->i_disk_size if necessary. Since pagecache was truncated just + * above, we don't need a precise end offset and can flush through the + * end of the mapping. */ - if (did_zeroing || - (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) { - error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, - ip->i_disk_size, newsize - 1); + if (did_zeroing) + flushoff = min_t(loff_t, oldsize, newsize); + if (newsize > ip->i_disk_size && oldsize != ip->i_disk_size) + flushoff = min_t(loff_t, flushoff, ip->i_disk_size); + if (flushoff != LLONG_MAX) { + error = filemap_write_and_wait_range(inode->i_mapping, flushoff, + LLONG_MAX); if (error) return error; } -- 2.37.3