On Thu 30-05-19 08:14:45, Dave Chinner wrote: > On Wed, May 29, 2019 at 03:46:29PM +0200, Jan Kara wrote: > > On Wed 29-05-19 14:46:58, Dave Chinner wrote: > > > iomap_apply() > > > > > > ->iomap_begin() > > > map old data extent that we copy from > > > > > > allocate new data extent we copy to in data fork, > > > immediately replacing old data extent > > > > > > return transaction handle as private data > > This holds the inode block map locked exclusively across the IO, > so.... Does it? We do hold XFS_IOLOCK_EXCL during the whole dax write. But xfs_file_iomap_begin() does release XFS_ILOCK_* on exit AFAICS. So I don't see anything that would prevent page fault from mapping blocks into page tables just after xfs_file_iomap_begin() returns. > > > dax_iomap_actor() > > > copies data from old extent to new extent > > > > > > ->iomap_end > > > commits transaction now data has been copied, making > > > the COW operation atomic with the data copy. > > > > > > > > > This, in fact, should be how we do all DAX writes that require > > > allocation, because then we get rid of the need to zero newly > > > allocated or unwritten extents before we copy the data into it. i.e. > > > we only need to write once to newly allocated storage rather than > > > twice. > > > > You need to be careful though. You need to synchronize with page faults so > > that they cannot see and expose in page tables blocks you've allocated > > before their contents is filled. > > ... so the page fault will block trying to map the blocks because > it can't get the xfs_inode->i_ilock until the allocation transaciton > commits.... > > > This race was actually the strongest > > motivation for pre-zeroing of blocks. OTOH copy_from_iter() in > > dax_iomap_actor() needs to be able to fault pages to copy from (and these > > pages may be from the same file you're writing to) so you cannot just block > > faulting for the file through I_MMAP_LOCK. > > Right, it doesn't take the I_MMAP_LOCK, but it would block further > in. And, really, I'm not caring all this much about this corner > case. i.e. anyone using a "mmap()+write() zero copy" pattern on DAX > within a file is unbeleivably naive - the data still gets copied by > the CPU in the write() call. It's far simpler and more effcient to > just mmap() both ranges of the file(s) and memcpy() in userspace.... > > FWIW, it's to avoid problems with stupid userspace stuff that nobody > really should be doing that I want range locks for the XFS inode > locks. If userspace overlaps the ranges and deadlocks in that case, > they they get to keep all the broken bits because, IMO, they are > doing something monumentally stupid. I'd probably be making it > return EDEADLOCK back out to userspace in the case rather than > deadlocking but, fundamentally, I think it's broken behaviour that > we should be rejecting with an error rather than adding complexity > trying to handle it. I agree with this. We must just prevent user from taking the kernel down with maliciously created IOs... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR