Looks good to me, and tested w/ a testcase I'll send in a bit. Reviewed-by: Eric Sandeen <sandeen@xxxxxxxxxx> On 9/28/15 1:38 PM, Brian Foster wrote: > XFS supports and typically allows concurrent asynchronous direct I/O > submission to a single file. One exception to the rule is that file > extending dio writes that start beyond the current EOF (e.g., > potentially create a hole at EOF) require exclusive I/O access to the > file. This is because such writes must zero any pre-existing blocks > beyond EOF that are exposed by virtue of now residing within EOF as a > result of the write about to be submitted. > > Before EOF zeroing can occur, the current file i_size must be stabilized > to avoid data corruption. In this scenario, XFS upgrades the iolock to > exclude any further I/O submission, waits on in-flight I/O to complete > to ensure i_size is up to date (i_size is updated on dio write > completion) and restarts the various checks against the state of the > file. The problem is that this protection sequence is triggered only > when the iolock is currently held shared. While this is true for async > dio in most cases, the caller may upgrade the lock in advance based on > arbitrary circumstances with respect to EOF zeroing. For example, the > iolock is always acquired exclusively if the start offset is not block > aligned. This means that even though the iolock is already held > exclusive for such I/Os, pending I/O is not drained and thus EOF zeroing > can occur based on an unstable i_size. > > This problem has been reproduced as guest data corruption in virtual > machines with file-backed qcow2 virtual disks hosted on an XFS > filesystem. The virtual disks must be configured with aio=native mode > and the must not be truncated out to the maximum file size (as some virt > managers will do). > > Update xfs_file_aio_write_checks() to unconditionally drain in-flight > dio before EOF zeroing can occur. Rather than trigger the wait based on > iolock state, use a new flag and upgrade the iolock when necessary. Note > that this results in a full restart of the inode checks even when the > iolock was already held exclusive when technically it is only required > to recheck i_size. This should be a rare enough occurrence that it is > preferable to keep the code simple rather than create an alternate > restart jump target. > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > fs/xfs/xfs_file.c | 15 +++++++++------ > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index e78feb4..347b3e0 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -574,6 +574,7 @@ xfs_file_aio_write_checks( > struct xfs_inode *ip = XFS_I(inode); > ssize_t error = 0; > size_t count = iov_iter_count(from); > + bool drained_dio = false; > > restart: > error = generic_write_checks(iocb, from); > @@ -611,12 +612,13 @@ restart: > bool zero = false; > > spin_unlock(&ip->i_flags_lock); > - if (*iolock == XFS_IOLOCK_SHARED) { > - xfs_rw_iunlock(ip, *iolock); > - *iolock = XFS_IOLOCK_EXCL; > - xfs_rw_ilock(ip, *iolock); > - iov_iter_reexpand(from, count); > - > + if (!drained_dio) { > + if (*iolock == XFS_IOLOCK_SHARED) { > + xfs_rw_iunlock(ip, *iolock); > + *iolock = XFS_IOLOCK_EXCL; > + xfs_rw_ilock(ip, *iolock); > + iov_iter_reexpand(from, count); > + } > /* > * We now have an IO submission barrier in place, but > * AIO can do EOF updates during IO completion and hence > @@ -626,6 +628,7 @@ restart: > * no-op. > */ > inode_dio_wait(inode); > + drained_dio = true; > goto restart; > } > error = xfs_zero_eof(ip, iocb->ki_pos, i_size_read(inode), &zero); > _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs