Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Wed, 14 Aug 2019 08:33:49 -0700

On Wed, Aug 14, 2019 at 07:54:48PM +1000, Dave Chinner wrote:
> On Tue, Aug 13, 2019 at 08:40:10AM -0700, Matthew Wilcox wrote:
> > On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote:
> > > +		/*
> > > +		 * Now that we've locked both pages, make sure they still
> > > +		 * represent the data we're interested in.  If not, someone
> > > +		 * is invalidating pages on us and we lose.
> > > +		 */
> > > +		if (src_page->mapping != src->i_mapping ||
> > > +		    src_page->index != srcoff >> PAGE_SHIFT ||
> > > +		    dest_page->mapping != dest->i_mapping ||
> > > +		    dest_page->index != destoff >> PAGE_SHIFT) {
> > > +			same = false;
> > > +			goto unlock;
> > > +		}
> > 
> > It is my understanding that you don't need to check the ->index here.
> > If I'm wrong about that, I'd really appreciate being corrected, because
> > the page cache locking is subtle.
> 
> Ah, when talking to Darrick about this, I didn't notice the code
> took references on the page, so it probably doesn't need the index
> check - the page can't be recycled out from under us here an
> inserted into a new mapping until we drop the reference.
> 
> What I was mainly concerned about here is that we only have a shared
> inode lock on the src inode, so this code can be running
> concurrently with both invalidation and insertion into the mapping.
> e.g. direct write io does invalidation, buffered read does
> insertion. Hence we have to be really careful about the data in the
> source page being valid and stable while we run the comparison.
> 
> And on further thought, I don't think shared locking is actually
> safe here. A shared lock doesn't stop new direct IO from being
> submitted, so inode_dio_wait() just drains IO at that point in time
> and but doesn't provide any guarantee that there isn't concurrent
> DIO running.
> 
> Hence we could do the comparison here, see the data is the same,
> drop the page lock, a DIO write then invalidates the page and writes
> new data while we are comparing the rest of page(s) in the range. By
> the time we've checked the whole range, the data at the start is no
> longer the same, and the comparison is stale.
> 
> And then we do the dedupe operation oblivious to the fact the data
> on disk doesn't actually match anymore, and we corrupt the data in
> the destination file as it gets linked to mismatched data in the
> source file....

<urrrrrrk> Yeah, that looks like a bug to me.  I didn't realize that
directio writes were IOLOCK_SHARED and therefore reflink has to take
IOLOCK_EXCL to block them.

Related question: do we need to do the same for MMAPLOCK?  I think the
answer is yes because xfs_filemap_fault can call page_mkwrite with
MMAPLOCK_SHARED.

--D

> Darrick?
> 
> > You call read_mapping_page() which returns the page with an elevated
> > refcount.  That means the page can't go back to the page allocator and
> > be allocated again.  It can, because it's unlocked, still be truncated,
> > so the check for ->mapping after locking it is needed.  But the check
> > for ->index being correct was done by find_get_entry().
> > 
> > See pagecache_get_page() -- if we specify FGP_LOCK, then it will lock
> > the page, check the ->mapping but not check ->index.  OK, it does check
> > ->index, but in a VM_BUG_ON(), so it's not something that ought to be
> > able to be wrong.
> 
> Yeah, we used to have to play tricks in the old XFS writeback
> clustering code to do our own non-blocking page cache lookups adn
> this was one of the things we needed to be careful about until
> the pagevec_lookup* interfaces came along and solved all the
> problems for us. Funny how the brain remembers old gotchas with
> also reminding you that the problems went away almost as long
> ago.....
> 
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx