On Wed, May 29, 2019 at 03:46:29PM +0200, Jan Kara wrote: > On Wed 29-05-19 14:46:58, Dave Chinner wrote: > > iomap_apply() > > > > ->iomap_begin() > > map old data extent that we copy from > > > > allocate new data extent we copy to in data fork, > > immediately replacing old data extent > > > > return transaction handle as private data This holds the inode block map locked exclusively across the IO, so.... > > > > dax_iomap_actor() > > copies data from old extent to new extent > > > > ->iomap_end > > commits transaction now data has been copied, making > > the COW operation atomic with the data copy. > > > > > > This, in fact, should be how we do all DAX writes that require > > allocation, because then we get rid of the need to zero newly > > allocated or unwritten extents before we copy the data into it. i.e. > > we only need to write once to newly allocated storage rather than > > twice. > > You need to be careful though. You need to synchronize with page faults so > that they cannot see and expose in page tables blocks you've allocated > before their contents is filled. ... so the page fault will block trying to map the blocks because it can't get the xfs_inode->i_ilock until the allocation transaciton commits.... > This race was actually the strongest > motivation for pre-zeroing of blocks. OTOH copy_from_iter() in > dax_iomap_actor() needs to be able to fault pages to copy from (and these > pages may be from the same file you're writing to) so you cannot just block > faulting for the file through I_MMAP_LOCK. Right, it doesn't take the I_MMAP_LOCK, but it would block further in. And, really, I'm not caring all this much about this corner case. i.e. anyone using a "mmap()+write() zero copy" pattern on DAX within a file is unbeleivably naive - the data still gets copied by the CPU in the write() call. It's far simpler and more effcient to just mmap() both ranges of the file(s) and memcpy() in userspace.... FWIW, it's to avoid problems with stupid userspace stuff that nobody really should be doing that I want range locks for the XFS inode locks. If userspace overlaps the ranges and deadlocks in that case, they they get to keep all the broken bits because, IMO, they are doing something monumentally stupid. I'd probably be making it return EDEADLOCK back out to userspace in the case rather than deadlocking but, fundamentally, I think it's broken behaviour that we should be rejecting with an error rather than adding complexity trying to handle it. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx