On Thu, Aug 15, 2019 at 07:28:33AM +1000, Dave Chinner wrote: > On Wed, Aug 14, 2019 at 08:33:49AM -0700, Darrick J. Wong wrote: > > On Wed, Aug 14, 2019 at 07:54:48PM +1000, Dave Chinner wrote: > > > On Tue, Aug 13, 2019 at 08:40:10AM -0700, Matthew Wilcox wrote: > > > > On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote: > > > > > + /* > > > > > + * Now that we've locked both pages, make sure they still > > > > > + * represent the data we're interested in. If not, someone > > > > > + * is invalidating pages on us and we lose. > > > > > + */ > > > > > + if (src_page->mapping != src->i_mapping || > > > > > + src_page->index != srcoff >> PAGE_SHIFT || > > > > > + dest_page->mapping != dest->i_mapping || > > > > > + dest_page->index != destoff >> PAGE_SHIFT) { > > > > > + same = false; > > > > > + goto unlock; > > > > > + } > > > > > > > > It is my understanding that you don't need to check the ->index here. > > > > If I'm wrong about that, I'd really appreciate being corrected, because > > > > the page cache locking is subtle. > > > > > > Ah, when talking to Darrick about this, I didn't notice the code > > > took references on the page, so it probably doesn't need the index > > > check - the page can't be recycled out from under us here an > > > inserted into a new mapping until we drop the reference. > > > > > > What I was mainly concerned about here is that we only have a shared > > > inode lock on the src inode, so this code can be running > > > concurrently with both invalidation and insertion into the mapping. > > > e.g. direct write io does invalidation, buffered read does > > > insertion. Hence we have to be really careful about the data in the > > > source page being valid and stable while we run the comparison. > > > > > > And on further thought, I don't think shared locking is actually > > > safe here. A shared lock doesn't stop new direct IO from being > > > submitted, so inode_dio_wait() just drains IO at that point in time > > > and but doesn't provide any guarantee that there isn't concurrent > > > DIO running. > > > > > > Hence we could do the comparison here, see the data is the same, > > > drop the page lock, a DIO write then invalidates the page and writes > > > new data while we are comparing the rest of page(s) in the range. By > > > the time we've checked the whole range, the data at the start is no > > > longer the same, and the comparison is stale. > > > > > > And then we do the dedupe operation oblivious to the fact the data > > > on disk doesn't actually match anymore, and we corrupt the data in > > > the destination file as it gets linked to mismatched data in the > > > source file.... > > > > <urrrrrrk> Yeah, that looks like a bug to me. I didn't realize that > > directio writes were IOLOCK_SHARED and therefore reflink has to take > > IOLOCK_EXCL to block them. > > > > Related question: do we need to do the same for MMAPLOCK? I think the > > answer is yes because xfs_filemap_fault can call page_mkwrite with > > MMAPLOCK_SHARED. > > Hmmmm. Yes, you are right, but I don't just holding MMAPLOCK_EXCL is > enough. Holding the MMAPLOCK will hold off page faults while we have > the lock, but it won't prevent pages that already have writeable > ptes from being modified as they don't require another page fault > until they've been cleaned. > > So it seems to me that if we need to ensure that the file range is > not being concurrently modified, we have to: > > a) inode lock exclusive > b) mmap lock exclusive > c) break layouts(*) > d) wait for dio > e) clean all the dirty pages > > On both the source and destination files. And then, because the > locks we hold form a barrier against newly dirtied pages, will all > attempts to modify the data be blocked. And so now the data > comparison can be done safely. I think xfs already proceeds in this order (a-e), it's just that we aren't correctly taking IOLOCK_EXCL/MMAPLOCK_EXCL on the source file to prevent some other thread from starting a directio write or dirtying an mmap'd page. But let's try my crappy little patch that fixes the shared locks and see what other smoke comes out of the machine... --D > (*) The break layouts bit is necessary to handle co-ordination with > RDMA, NVMEoF, P2P DMA, pNFS, etc that manipulate data directly via > the block device rather than through file pages or DIO... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx