Re: [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 13 May 2021 11:52:52 -0700

On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > +->fallocate implementation must be really careful to maintain page cache
> > > +consistency when punching holes or performing other operations that invalidate
> > > +page cache contents. Usually the filesystem needs to call
> > > +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> > > +However the filesystem usually also needs to update its internal (and on disk)
> > > +view of file offset -> disk block mapping. Until this update is finished, the
> > > +filesystem needs to block page faults and reads from reloading now-stale page
> > > +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> > > +and acquires it in shared mode in paths loading pages from disk
> > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > +responsible for taking this lock in its fallocate implementation and generally
> > > +whenever the page cache contents needs to be invalidated because a block is
> > > +moving from under a page.
> > > +
> > > +->copy_file_range and ->remap_file_range implementations need to serialize
> > > +against modifications of file data while the operation is running. For blocking
> > > +changes through write(2) and similar operations inode->i_rwsem can be used. For
> > > +blocking changes through memory mapping, the filesystem can use
> > > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> > > +implementation.
> > 
> > Question: What is the locking order when acquiring the invalidate_lock
> > of two different files?  Is it the same as i_rwsem (increasing order of
> > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > being hoisted here (increasing order of i_ino)?
> > 
> > The reason I ask is that remap_file_range has to do that, but I don't
> > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > calls in xfs_ilock2_io_mmap in this series.
> 
> Good question. Technically, I don't think there's real need to establish a
> single ordering because locks among different filesystems are never going
> to be acquired together (effectively each lock type is local per sb and we
> are free to define an ordering for each lock type differently). But to
> maintain some sanity I guess having the same locking order for doublelock
> of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> by-ino ordering? So that we don't have to consider two different orders in
> xfs_lock_two_inodes()...

I imagine Dave will chime in on this, but I suspect the reason is
hysterical raisins^Wreasons.  It might simply be time to convert all
three XFS inode locks to use the same ordering rules.

--D

> 
> 								Honza
> 
> > > +
> > >  dquot_operations
> > >  ================
> > >  
> > > @@ -634,9 +658,9 @@ access:		yes
> > >  to be faulted in. The filesystem must find and return the page associated
> > >  with the passed in "pgoff" in the vm_fault structure. If it is possible that
> > >  the page may be truncated and/or invalidated, then the filesystem must lock
> > > -the page, then ensure it is not already truncated (the page lock will block
> > > -subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
> > > -locked. The VM will unlock the page.
> > > +invalidate_lock, then ensure the page is not already truncated (invalidate_lock
> > > +will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the
> > > +page locked. The VM will unlock the page.
> > >  
> > >  ->map_pages() is called when VM asks to map easy accessible pages.
> > >  Filesystem should find and map pages associated with offsets from "start_pgoff"
> > > @@ -647,12 +671,14 @@ page table entry. Pointer to entry associated with the page is passed in
> > >  "pte" field in vm_fault structure. Pointers to entries for other offsets
> > >  should be calculated relative to "pte".
> > >  
> > > -->page_mkwrite() is called when a previously read-only pte is
> > > -about to become writeable. The filesystem again must ensure that there are
> > > -no truncate/invalidate races, and then return with the page locked. If
> > > -the page has been truncated, the filesystem should not look up a new page
> > > -like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
> > > -will cause the VM to retry the fault.
> > > +->page_mkwrite() is called when a previously read-only pte is about to become
> > > +writeable. The filesystem again must ensure that there are no
> > > +truncate/invalidate races or races with operations such as ->remap_file_range
> > > +or ->copy_file_range, and then return with the page locked. Usually
> > > +mapping->invalidate_lock is suitable for proper serialization. If the page has
> > > +been truncated, the filesystem should not look up a new page like the ->fault()
> > > +handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to
> > > +retry the fault.
> > >  
> > >  ->pfn_mkwrite() is the same as page_mkwrite but when the pte is
> > >  VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index c93500d84264..63a814367118 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -190,6 +190,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> > >  	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> > >  	mapping->private_data = NULL;
> > >  	mapping->writeback_index = 0;
> > > +	init_rwsem(&mapping->invalidate_lock);
> > > +	lockdep_set_class(&mapping->invalidate_lock,
> > > +			  &sb->s_type->invalidate_lock_key);
> > >  	inode->i_private = NULL;
> > >  	inode->i_mapping = mapping;
> > >  	INIT_HLIST_HEAD(&inode->i_dentry);	/* buggered by rcu freeing */
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index c3c88fdb9b2a..897238d9f1e0 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -436,6 +436,10 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
> > >   * struct address_space - Contents of a cacheable, mappable object.
> > >   * @host: Owner, either the inode or the block_device.
> > >   * @i_pages: Cached pages.
> > > + * @invalidate_lock: Guards coherency between page cache contents and
> > > + *   file offset->disk block mappings in the filesystem during invalidates.
> > > + *   It is also used to block modification of page cache contents through
> > > + *   memory mappings.
> > >   * @gfp_mask: Memory allocation flags to use for allocating pages.
> > >   * @i_mmap_writable: Number of VM_SHARED mappings.
> > >   * @nr_thps: Number of THPs in the pagecache (non-shmem only).
> > > @@ -453,6 +457,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
> > >  struct address_space {
> > >  	struct inode		*host;
> > >  	struct xarray		i_pages;
> > > +	struct rw_semaphore	invalidate_lock;
> > >  	gfp_t			gfp_mask;
> > >  	atomic_t		i_mmap_writable;
> > >  #ifdef CONFIG_READ_ONLY_THP_FOR_FS
> > > @@ -2488,6 +2493,7 @@ struct file_system_type {
> > >  
> > >  	struct lock_class_key i_lock_key;
> > >  	struct lock_class_key i_mutex_key;
> > > +	struct lock_class_key invalidate_lock_key;
> > >  	struct lock_class_key i_mutex_dir_key;
> > >  };
> > >  
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index ba1068a1837f..4d9ec4c6cc34 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -77,7 +77,8 @@
> > >   *        ->i_pages lock
> > >   *
> > >   *  ->i_rwsem
> > > - *    ->i_mmap_rwsem		(truncate->unmap_mapping_range)
> > > + *    ->invalidate_lock		(acquired by fs in truncate path)
> > > + *      ->i_mmap_rwsem		(truncate->unmap_mapping_range)
> > >   *
> > >   *  ->mmap_lock
> > >   *    ->i_mmap_rwsem
> > > @@ -85,7 +86,8 @@
> > >   *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock)
> > >   *
> > >   *  ->mmap_lock
> > > - *    ->lock_page		(access_process_vm)
> > > + *    ->invalidate_lock		(filemap_fault)
> > > + *      ->lock_page		(filemap_fault, access_process_vm)
> > >   *
> > >   *  ->i_rwsem			(generic_perform_write)
> > >   *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
> > > @@ -2368,20 +2370,30 @@ static int filemap_update_page(struct kiocb *iocb,
> > >  {
> > >  	int error;
> > >  
> > > +	if (iocb->ki_flags & IOCB_NOWAIT) {
> > > +		if (!down_read_trylock(&mapping->invalidate_lock))
> > > +			return -EAGAIN;
> > > +	} else {
> > > +		down_read(&mapping->invalidate_lock);
> > > +	}
> > > +
> > >  	if (!trylock_page(page)) {
> > > +		error = -EAGAIN;
> > >  		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
> > > -			return -EAGAIN;
> > > +			goto unlock_mapping;
> > >  		if (!(iocb->ki_flags & IOCB_WAITQ)) {
> > > +			up_read(&mapping->invalidate_lock);
> > >  			put_and_wait_on_page_locked(page, TASK_KILLABLE);
> > >  			return AOP_TRUNCATED_PAGE;
> > >  		}
> > >  		error = __lock_page_async(page, iocb->ki_waitq);
> > >  		if (error)
> > > -			return error;
> > > +			goto unlock_mapping;
> > >  	}
> > >  
> > > +	error = AOP_TRUNCATED_PAGE;
> > >  	if (!page->mapping)
> > > -		goto truncated;
> > > +		goto unlock;
> > >  
> > >  	error = 0;
> > >  	if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page))
> > > @@ -2392,15 +2404,13 @@ static int filemap_update_page(struct kiocb *iocb,
> > >  		goto unlock;
> > >  
> > >  	error = filemap_read_page(iocb->ki_filp, mapping, page);
> > > -	if (error == AOP_TRUNCATED_PAGE)
> > > -		put_page(page);
> > > -	return error;
> > > -truncated:
> > > -	unlock_page(page);
> > > -	put_page(page);
> > > -	return AOP_TRUNCATED_PAGE;
> > > +	goto unlock_mapping;
> > >  unlock:
> > >  	unlock_page(page);
> > > +unlock_mapping:
> > > +	up_read(&mapping->invalidate_lock);
> > > +	if (error == AOP_TRUNCATED_PAGE)
> > > +		put_page(page);
> > >  	return error;
> > >  }
> > >  
> > > @@ -2415,6 +2425,19 @@ static int filemap_create_page(struct file *file,
> > >  	if (!page)
> > >  		return -ENOMEM;
> > >  
> > > +	/*
> > > +	 * Protect against truncate / hole punch. Grabbing invalidate_lock here
> > > +	 * assures we cannot instantiate and bring uptodate new pagecache pages
> > > +	 * after evicting page cache during truncate and before actually
> > > +	 * freeing blocks.  Note that we could release invalidate_lock after
> > > +	 * inserting the page into page cache as the locked page would then be
> > > +	 * enough to synchronize with hole punching. But there are code paths
> > > +	 * such as filemap_update_page() filling in partially uptodate pages or
> > > +	 * ->readpages() that need to hold invalidate_lock while mapping blocks
> > > +	 * for IO so let's hold the lock here as well to keep locking rules
> > > +	 * simple.
> > > +	 */
> > > +	down_read(&mapping->invalidate_lock);
> > >  	error = add_to_page_cache_lru(page, mapping, index,
> > >  			mapping_gfp_constraint(mapping, GFP_KERNEL));
> > >  	if (error == -EEXIST)
> > > @@ -2426,9 +2449,11 @@ static int filemap_create_page(struct file *file,
> > >  	if (error)
> > >  		goto error;
> > >  
> > > +	up_read(&mapping->invalidate_lock);
> > >  	pagevec_add(pvec, page);
> > >  	return 0;
> > >  error:
> > > +	up_read(&mapping->invalidate_lock);
> > >  	put_page(page);
> > >  	return error;
> > >  }
> > > @@ -2988,6 +3013,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > >  		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
> > >  		ret = VM_FAULT_MAJOR;
> > >  		fpin = do_sync_mmap_readahead(vmf);
> > > +	}
> > > +
> > > +	/*
> > > +	 * See comment in filemap_create_page() why we need invalidate_lock
> > > +	 */
> > > +	down_read(&mapping->invalidate_lock);
> > > +	if (!page) {
> > >  retry_find:
> > >  		page = pagecache_get_page(mapping, offset,
> > >  					  FGP_CREAT|FGP_FOR_MMAP,
> > > @@ -2995,6 +3027,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > >  		if (!page) {
> > >  			if (fpin)
> > >  				goto out_retry;
> > > +			up_read(&mapping->invalidate_lock);
> > >  			return VM_FAULT_OOM;
> > >  		}
> > >  	}
> > > @@ -3035,9 +3068,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > >  	if (unlikely(offset >= max_off)) {
> > >  		unlock_page(page);
> > >  		put_page(page);
> > > +		up_read(&mapping->invalidate_lock);
> > >  		return VM_FAULT_SIGBUS;
> > >  	}
> > >  
> > > +	up_read(&mapping->invalidate_lock);
> > >  	vmf->page = page;
> > >  	return ret | VM_FAULT_LOCKED;
> > >  
> > > @@ -3056,6 +3091,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > >  
> > >  	if (!error || error == AOP_TRUNCATED_PAGE)
> > >  		goto retry_find;
> > > +	up_read(&mapping->invalidate_lock);
> > >  
> > >  	return VM_FAULT_SIGBUS;
> > >  
> > > @@ -3067,6 +3103,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > >  	 */
> > >  	if (page)
> > >  		put_page(page);
> > > +	up_read(&mapping->invalidate_lock);
> > >  	if (fpin)
> > >  		fput(fpin);
> > >  	return ret | VM_FAULT_RETRY;
> > > @@ -3437,6 +3474,8 @@ static struct page *do_read_cache_page(struct address_space *mapping,
> > >   *
> > >   * If the page does not get brought uptodate, return -EIO.
> > >   *
> > > + * The function expects mapping->invalidate_lock to be already held.
> > > + *
> > >   * Return: up to date page on success, ERR_PTR() on failure.
> > >   */
> > >  struct page *read_cache_page(struct address_space *mapping,
> > > @@ -3460,6 +3499,8 @@ EXPORT_SYMBOL(read_cache_page);
> > >   *
> > >   * If the page does not get brought uptodate, return -EIO.
> > >   *
> > > + * The function expects mapping->invalidate_lock to be already held.
> > > + *
> > >   * Return: up to date page on success, ERR_PTR() on failure.
> > >   */
> > >  struct page *read_cache_page_gfp(struct address_space *mapping,
> > > diff --git a/mm/readahead.c b/mm/readahead.c
> > > index d589f147f4c2..9785c54107bb 100644
> > > --- a/mm/readahead.c
> > > +++ b/mm/readahead.c
> > > @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> > >  	 */
> > >  	unsigned int nofs = memalloc_nofs_save();
> > >  
> > > +	down_read(&mapping->invalidate_lock);
> > >  	/*
> > >  	 * Preallocate as many pages as we will need.
> > >  	 */
> > > @@ -236,6 +237,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> > >  	 * will then handle the error.
> > >  	 */
> > >  	read_pages(ractl, &page_pool, false);
> > > +	up_read(&mapping->invalidate_lock);
> > >  	memalloc_nofs_restore(nofs);
> > >  }
> > >  EXPORT_SYMBOL_GPL(page_cache_ra_unbounded);
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index a35cbbbded0d..76d33c3b8ae6 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -22,24 +22,25 @@
> > >   *
> > >   * inode->i_rwsem	(while writing or truncating, not reading or faulting)
> > >   *   mm->mmap_lock
> > > - *     page->flags PG_locked (lock_page)   * (see hugetlbfs below)
> > > - *       hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
> > > - *         mapping->i_mmap_rwsem
> > > - *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
> > > - *           anon_vma->rwsem
> > > - *             mm->page_table_lock or pte_lock
> > > - *               swap_lock (in swap_duplicate, swap_info_get)
> > > - *                 mmlist_lock (in mmput, drain_mmlist and others)
> > > - *                 mapping->private_lock (in __set_page_dirty_buffers)
> > > - *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
> > > - *                     i_pages lock (widely used)
> > > - *                       lruvec->lru_lock (in lock_page_lruvec_irq)
> > > - *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> > > - *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> > > - *                   sb_lock (within inode_lock in fs/fs-writeback.c)
> > > - *                   i_pages lock (widely used, in set_page_dirty,
> > > - *                             in arch-dependent flush_dcache_mmap_lock,
> > > - *                             within bdi.wb->list_lock in __sync_single_inode)
> > > + *     mapping->invalidate_lock (in filemap_fault)
> > > + *       page->flags PG_locked (lock_page)   * (see hugetlbfs below)
> > > + *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
> > > + *           mapping->i_mmap_rwsem
> > > + *             hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
> > > + *             anon_vma->rwsem
> > > + *               mm->page_table_lock or pte_lock
> > > + *                 swap_lock (in swap_duplicate, swap_info_get)
> > > + *                   mmlist_lock (in mmput, drain_mmlist and others)
> > > + *                   mapping->private_lock (in __set_page_dirty_buffers)
> > > + *                     lock_page_memcg move_lock (in __set_page_dirty_buffers)
> > > + *                       i_pages lock (widely used)
> > > + *                         lruvec->lru_lock (in lock_page_lruvec_irq)
> > > + *                   inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> > > + *                   bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> > > + *                     sb_lock (within inode_lock in fs/fs-writeback.c)
> > > + *                     i_pages lock (widely used, in set_page_dirty,
> > > + *                               in arch-dependent flush_dcache_mmap_lock,
> > > + *                               within bdi.wb->list_lock in __sync_single_inode)
> > >   *
> > >   * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon)
> > >   *   ->tasklist_lock
> > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > index 57a618c4a0d6..93bde2741e0e 100644
> > > --- a/mm/truncate.c
> > > +++ b/mm/truncate.c
> > > @@ -415,7 +415,7 @@ EXPORT_SYMBOL(truncate_inode_pages_range);
> > >   * @mapping: mapping to truncate
> > >   * @lstart: offset from which to truncate
> > >   *
> > > - * Called under (and serialised by) inode->i_rwsem.
> > > + * Called under (and serialised by) inode->i_rwsem and inode->i_mapping_rwsem.
> > >   *
> > >   * Note: When this function returns, there can be a page in the process of
> > >   * deletion (inside __delete_from_page_cache()) in the specified range.  Thus
> > > -- 
> > > 2.26.2
> > > 
> -- 
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR