Re: [PATCH 3.18.y 3/5] mm: allow GFP_{FS,IO} for page_cache_read page cache allocation

Michal Hocko <mhocko@xxxxxxxxxx> · Sun, 22 Apr 2018 16:46:36 -0600

I am not reallu sure it is a good idea to blindly apply this patch to
the older stable tree. Have you checked all filemap_fault handlers?
This might have been quite different in 3.18 than for the kernel I was
developing this against.

If the sole purpose of this backport is to make other
patch (abc1be13fd11 ("mm/filemap.c: fix NULL pointer in
page_cache_tree_insert()")) apply easier then I've already suggested how
to handle those rejects.

On Mon 23-04-18 01:37:44, Harsh Shandilya wrote:
> From: Michal Hocko <mhocko@xxxxxxxx>
> 
> Commit c20cd45eb01748f0fba77a504f956b000df4ea73 upstream.
> 
> page_cache_read has been historically using page_cache_alloc_cold to
> allocate a new page.  This means that mapping_gfp_mask is used as the
> base for the gfp_mask.  Many filesystems are setting this mask to
> GFP_NOFS to prevent from fs recursion issues.  page_cache_read is called
> from the vm_operations_struct::fault() context during the page fault.
> This context doesn't need the reclaim protection normally.
> 
> ceph and ocfs2 which call filemap_fault from their fault handlers seem
> to be OK because they are not taking any fs lock before invoking generic
> implementation.  xfs which takes XFS_MMAPLOCK_SHARED is safe from the
> reclaim recursion POV because this lock serializes truncate and punch
> hole with the page faults and it doesn't get involved in the reclaim.
> 
> There is simply no reason to deliberately use a weaker allocation
> context when a __GFP_FS | __GFP_IO can be used.  The GFP_NOFS protection
> might be even harmful.  There is a push to fail GFP_NOFS allocations
> rather than loop within allocator indefinitely with a very limited
> reclaim ability.  Once we start failing those requests the OOM killer
> might be triggered prematurely because the page cache allocation failure
> is propagated up the page fault path and end up in
> pagefault_out_of_memory.
> 
> We cannot play with mapping_gfp_mask directly because that would be racy
> wrt.  parallel page faults and it might interfere with other users who
> really rely on NOFS semantic from the stored gfp_mask.  The mask is also
> inode proper so it would even be a layering violation.  What we can do
> instead is to push the gfp_mask into struct vm_fault and allow fs layer
> to overwrite it should the callback need to be called with a different
> allocation context.
> 
> Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
> because this should be safe from the page fault path normally.  Why do
> we care about mapping_gfp_mask at all then? Because this doesn't hold
> only reclaim protection flags but it also might contain zone and
> movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have
> to respect those.
> 
> Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
> Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> Acked-by: Jan Kara <jack@xxxxxxxx>
> Acked-by: Vlastimil Babka <vbabka@xxxxxxx>
> Cc: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Cc: Dave Chinner <david@xxxxxxxxxxxxx>
> Cc: Mark Fasheh <mfasheh@xxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Harsh Shandilya <harsh@xxxxxxxx>
> ---
>  include/linux/mm.h |  4 ++++
>  mm/filemap.c       |  8 ++++----
>  mm/memory.c        | 17 +++++++++++++++++
>  3 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5adffb0a468f..9ac4697979e8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -203,9 +203,13 @@ extern pgprot_t protection_map[16];
>   *
>   * pgoff should be used in favour of virtual_address, if possible. If pgoff
>   * is used, one may implement ->remap_pages to get nonlinear mapping support.
> + *
> + * MM layer fills up gfp_mask for page allocations but fault handler might
> + * alter it if its implementation requires a different allocation context.
>   */
>  struct vm_fault {
>  	unsigned int flags;		/* FAULT_FLAG_xxx flags */
> +	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
>  	pgoff_t pgoff;			/* Logical page offset based on vma */
>  	void __user *virtual_address;	/* Faulting virtual address */
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7e6ab98d4d3c..aafeeefcb00d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1746,18 +1746,18 @@ EXPORT_SYMBOL(generic_file_read_iter);
>   * This adds the requested page to the page cache if it isn't already there,
>   * and schedules an I/O to read in its contents from disk.
>   */
> -static int page_cache_read(struct file *file, pgoff_t offset)
> +static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
>  {
>  	struct address_space *mapping = file->f_mapping;
>  	struct page *page;
>  	int ret;
>  
>  	do {
> -		page = page_cache_alloc_cold(mapping);
> +		page = __page_cache_alloc(gfp_mask|__GFP_COLD);
>  		if (!page)
>  			return -ENOMEM;
>  
> -		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
> +		ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask & GFP_KERNEL);
>  		if (ret == 0)
>  			ret = mapping->a_ops->readpage(file, page);
>  		else if (ret == -EEXIST)
> @@ -1940,7 +1940,7 @@ no_cached_page:
>  	 * We're only likely to ever get here if MADV_RANDOM is in
>  	 * effect.
>  	 */
> -	error = page_cache_read(file, offset);
> +	error = page_cache_read(file, offset, vmf->gfp_mask);
>  
>  	/*
>  	 * The page we want has now been added to the page cache.
> diff --git a/mm/memory.c b/mm/memory.c
> index 0c4f5e36b155..5a62c6a42143 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1973,6 +1973,20 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
>  		copy_user_highpage(dst, src, va, vma);
>  }
>  
> +static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
> +{
> +	struct file *vm_file = vma->vm_file;
> +
> +	if (vm_file)
> +		return mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO;
> +
> +	/*
> +	 * Special mappings (e.g. VDSO) do not have any file so fake
> +	 * a default GFP_KERNEL for them.
> +	 */
> +	return GFP_KERNEL;
> +}
> +
>  /*
>   * Notify the address space that the page is about to become writable so that
>   * it can prohibit this or wait for the page to get into an appropriate state.
> @@ -1988,6 +2002,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>  	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
>  	vmf.pgoff = page->index;
>  	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
>  	vmf.page = page;
>  
>  	ret = vma->vm_ops->page_mkwrite(vma, &vmf);
> @@ -2670,6 +2685,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  	vmf.pgoff = pgoff;
>  	vmf.flags = flags;
>  	vmf.page = NULL;
> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
>  
>  	ret = vma->vm_ops->fault(vma, &vmf);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> @@ -2834,6 +2850,7 @@ static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
>  	vmf.pgoff = pgoff;
>  	vmf.max_pgoff = max_pgoff;
>  	vmf.flags = flags;
> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
>  	vma->vm_ops->map_pages(vma, &vmf);
>  }
>  
> -- 
> 2.15.0.2308.g658a28aa74af
> 

-- 
Michal Hocko
SUSE Labs