XFS allows O_DIRECT writes to the same file to execute in parallel under the shared iolock. If the write offset is beyond the inode size (not appending), the write path cycles the exclusive iolock to check for previously unexposed blocks that must be zeroed. If I/O is synchronous, this has the side effect of waiting on all in-flight I/O to complete. If writes occur slightly out of order, however, it's possible for even O_SYNC|O_DIRECT writes to race to extend i_size in the end_io completion handler. For example, this can be easily manufactured with an artificial delay in xfs_end_io_direct_write(): if (offset + size > i_size_read(inode)) { mdelay(...); ... } Execute the following commands in order, but in parallel such that they both read the current i_size as 0 and delay to update it: $ xfs_io -f -d -s -c "pwrite 4k 4k" /mnt/file $ xfs_io -f -d -s -c "pwrite 0 4k" /mnt/file Since the write at 4k passes through the delay first, it sets i_size to 8k. Shortly after, the write to 0 sets i_size to 4k: $ ls -al /mnt/file -rw-------. 1 root root 4096 Apr 4 06:48 /mnt/file At this point, any further file extending writes consider the block at EOF (4k) as stale data that must be zeroed: $ xfs_io -f -d -s -c "pwrite 8k 4k" /mnt/file $ ls -al /mnt/file -rw-------. 1 root root 12288 Apr 4 06:51 /mnt/file $ hexdump /mnt/file 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001000 0000 0000 0000 0000 0000 0000 0000 0000 * 0002000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0003000 The i_size update in xfs_end_io_direct_write() is a fundamental test/set race. Grab i_lock around the i_size check and update to protect against concurrent extenders and ensure completion sees the latest i_size. Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> --- This is a first stab at a fix for the race described above. While we might want to do more here (e.g., maybe hold iolock shared over async dio for explicit truncate protection?), I would like to see this or something like it as a starting point so we have a backportable fix. Thoughts? Brian fs/xfs/xfs_aops.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 3a9b7a1..dfc4e94 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1463,10 +1463,17 @@ xfs_end_io_direct_write( * While the generic direct I/O code updates the inode size, it does * so only after the end_io handler is called, which means our * end_io handler thinks the on-disk size is outside the in-core - * size. To prevent this just update it a little bit earlier here. + * size. To prevent this just update it a little bit earlier here. + * + * Also, grab i_lock to prevent test/set races between extending I/Os. + * This can occur when a non-appending extending (pos > i_size) write is + * submitted out of offset order from an appending (pos == i_size) + * write. */ + spin_lock(&inode->i_lock); if (offset + size > i_size_read(inode)) i_size_write(inode, offset + size); + spin_unlock(&inode->i_lock); /* * For direct I/O we do not know if we need to allocate blocks or not, -- 1.9.3 _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs