[PATCH] xfs: use i_lock to prevent i_size race on dio write completion

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 7 Apr 2015 14:53:31 -0400

XFS allows O_DIRECT writes to the same file to execute in parallel under
the shared iolock. If the write offset is beyond the inode size (not
appending), the write path cycles the exclusive iolock to check for
previously unexposed blocks that must be zeroed. If I/O is synchronous,
this has the side effect of waiting on all in-flight I/O to complete.

If writes occur slightly out of order, however, it's possible for even
O_SYNC|O_DIRECT writes to race to extend i_size in the end_io completion
handler. For example, this can be easily manufactured with an artificial
delay in xfs_end_io_direct_write():

	if (offset + size > i_size_read(inode)) {
		mdelay(...);
		...
	}

Execute the following commands in order, but in parallel such that they
both read the current i_size as 0 and delay to update it:

$ xfs_io -f -d -s -c "pwrite 4k 4k" /mnt/file
$ xfs_io -f -d -s -c "pwrite 0 4k" /mnt/file

Since the write at 4k passes through the delay first, it sets i_size to
8k. Shortly after, the write to 0 sets i_size to 4k:

$ ls -al /mnt/file
-rw-------. 1 root root 4096 Apr  4 06:48 /mnt/file

At this point, any further file extending writes consider the block at
EOF (4k) as stale data that must be zeroed:

$ xfs_io -f -d -s -c "pwrite 8k 4k" /mnt/file
$ ls -al /mnt/file
-rw-------. 1 root root 12288 Apr  4 06:51 /mnt/file
$ hexdump /mnt/file
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0001000 0000 0000 0000 0000 0000 0000 0000 0000
*
0002000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0003000

The i_size update in xfs_end_io_direct_write() is a fundamental test/set
race. Grab i_lock around the i_size check and update to protect against
concurrent extenders and ensure completion sees the latest i_size.

Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
---

This is a first stab at a fix for the race described above. While we
might want to do more here (e.g., maybe hold iolock shared over async
dio for explicit truncate protection?), I would like to see this or
something like it as a starting point so we have a backportable fix.
Thoughts?

Brian

 fs/xfs/xfs_aops.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a9b7a1..dfc4e94 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1463,10 +1463,17 @@ xfs_end_io_direct_write(
 	 * While the generic direct I/O code updates the inode size, it does
 	 * so only after the end_io handler is called, which means our
 	 * end_io handler thinks the on-disk size is outside the in-core
-	 * size.  To prevent this just update it a little bit earlier here.
+	 * size. To prevent this just update it a little bit earlier here.
+	 *
+	 * Also, grab i_lock to prevent test/set races between extending I/Os.
+	 * This can occur when a non-appending extending (pos > i_size) write is
+	 * submitted out of offset order from an appending (pos == i_size)
+	 * write.
 	 */
+	spin_lock(&inode->i_lock);
 	if (offset + size > i_size_read(inode))
 		i_size_write(inode, offset + size);
+	spin_unlock(&inode->i_lock);
 
 	/*
 	 * For direct I/O we do not know if we need to allocate blocks or not,
-- 
1.9.3

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs