Re: Subtle races between DAX mmap fault and write path

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 27, 2016 at 03:10:39PM -0600, Ross Zwisler wrote:
> On Wed, Jul 27, 2016 at 02:07:45PM +0200, Jan Kara wrote:
> > Hi,
> > 
> > when testing my latest changes to DXA fault handling code I have hit the
> > following interesting race between the fault and write path (I'll show
> > function names for ext4 but xfs has the same issue AFAICT).

The XFS update I just pushed to Linus contains a rework of the XFS
DAX IO path. It no longer shares the XFS direct IO path, so it
doesn't contain any of the direct IO warts anymore.

> > We have a file 'f' which has a hole at offset 0.
> > 
> > Process 0				Process 1
> > 
> > data = mmap('f');
> > read data[0]
> >   -> fault, we map a hole page
> > 
> > 					pwrite('f', buf, len, 0)
> > 					  -> ext4_file_write_iter
> > 					    inode_lock(inode);
> > 					    __generic_file_write_iter()
> > 					      generic_file_direct_write()
> > 						invalidate_inode_pages2_range()
> > 						  - drops hole page from
> > 						    the radix tree
> > 						ext4_direct_IO()
> > 						  dax_do_io()
> > 						    - allocates block for
> > 						      offset 0
> > data[0] = 1
> >   -> page_mkwrite fault
> >     -> ext4_dax_fault()
> >       down_read(&EXT4_I(inode)->i_mmap_sem);
> >       __dax_fault()
> > 	grab_mapping_entry()
> > 	  - creates locked radix tree entry
> > 	- maps block into PTE
> > 	put_locked_mapping_entry()
> > 
> > 						invalidate_inode_pages2_range()
> > 						  - removes dax entry from
> > 						    the radix tree

In XFS, we don't call __generic_file_write_iter or
generic_file_direct_write(), and xfs_file_dax_write() does not have
this trailing call to invalidate_inode_pages2_range() anymore. It's
DAX - there's nothing to invalidate, right?

So I think Christoph just (accidentally) removed this race condition
from XFS....

> > So we have just lost information that block 0 is mapped and needs flushing
> > caches.
> > 
> > Also the fact that the consistency of data as viewed by mmap and
> > dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
> > unexpected to me and we should document it somewhere.

I don't think it does - what, in DAX, is incoherent? If anything,
it's the data in the direct IO buffer, not the view the mmap will
see. i.e. the post-write invalidate is to ensure that applications
that have mmapped the file see the data written by direct IO. That's
not necessary with DAX.

> > The question is how to best fix this. I see three options:
> > 
> > 1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
> > harsh but should work - we call filemap_write_and_wait() in
> > generic_file_direct_write() so we flush out all caches for the relevant
> > area before dropping radix tree entries.

We don't call filemap_write_and_wait() in xfs_file_dax_write()
anymore, either. It's DAX - we don't need to flush anything to read
the correct data, and there's nothing cached that becomes stale when
we overwrite the destination memory.

[snip]

> > Any opinions on this?
> 
> Can we just skip the two calls to invalidate_inode_pages2_range() in
> generic_file_direct_write() for DAX I/O?

The first invalidate is still there in XFS. The comment above it
in the new XFS code says:

	/*
	 * Yes, even DAX files can have page cache attached to them:  A zeroed
	 * page is inserted into the pagecache when we have to serve a write
	 * fault on a hole.  It should never be dirtied and can simply be
	 * dropped from the pagecache once we get real data for the page.
	 */
	if (mapping->nrpages) {
		ret = invalidate_inode_pages2(mapping);
		WARN_ON_ONCE(ret);
	}


> Similarly, for DAX I don't think we actually need to do the
> filemap_write_and_wait_range() call in generic_file_direct_write() either.
> It's a similar scenario - for direct I/O we are trying to make sure that any
> dirty data in the page cache is written out to media before the ->direct_IO()
> call happens.  For DAX I don't think we care.  If a user does an mmap() write
> which creates a dirty radix tree entry, then does a write(), we should be able
> to happily overwrite the old data with the new without flushing, and just
> leave the dirty radix tree entry in place.

Yup, that's pretty much how XFS treats DAX now - it's not direct IO,
but it's not buffered IO, either...

I don't think there's a problem that needs to be fixed in the DAX
code - the issue is all about the surrounding IO context. i.e
whether do_dax_io() is automatically coherent with mmap or not
because mmap directly exposes the CPU cache coherent memory
do_dax_io() modifies.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux