On Fri 27-03-15 08:43:54, Dave Chinner wrote: > On Thu, Mar 26, 2015 at 10:53:02AM +0100, Michal Hocko wrote: > > On Fri 20-03-15 14:48:20, Dave Chinner wrote: > > > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote: > > [...] > > > > Or did I miss your point? Are you concerned about some fs overloading > > > > filemap_fault and do some locking before delegating to filemap_fault? > > > > > > The latter: > > > > > > https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b > > > > Hmm. I am completely unfamiliar with the xfs code but my reading of > > 964aa8d9e4d3..723cac484733 is that the newly introduced lock should be > > OK from the reclaim recursion POV. It protects against truncate and > > punch hole, right? Or are there any internal paths which I am missing > > and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held? > > It might be OK, but you're only looking at the example I gave you, > not the fundamental issue it demonstrates. That is: filesystems may > have *internal dependencies that are unknown to the page cache or mm > subsystem*. Hence the page cache or mm allocations cannot > arbitrarily ignore allocation constraints the filesystem assigns to > mapping operations.... I fully understand that. I am just trying to understand what are the real requirements from filesystems wrt filemap_fault. mapping gfp mask is not usable much for that because e.g. xfs has GFP_NOFS set for the whole inode life AFAICS. And it seems that this context is not really required even after the recent code changes. We can add gfp_mask into struct vm_fault and initialize it to mapping_gfp_mask | GFP_IOFS and .fault() callback might overwrite it. This would be cleaner than unconditional gfp manipulation (the patch below). But we are in a really hard position if the GFP_NOFS context is really required here. We shouldn't really trigger OOM killer because that could be premature and way too disruptive. We can retry the page fault or the allocation but that both sound suboptimal to me. Do you have any other suggestions? This hasn't been tested yet it just shows the idea mentioned above. --- >From 292cfcbbe18b2afc8d2bc0cf568ca4c5842d4c8f Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@xxxxxxx> Date: Fri, 27 Mar 2015 13:33:51 +0100 Subject: [PATCH] mm: Allow GFP_IOFS for page_cache_read page cache allocation page_cache_read has been historically using page_cache_alloc_cold to allocate a new page. This means that mapping_gfp_mask is used as the base for the gfp_mask. Many filesystems are setting this mask to GFP_NOFS to prevent from fs recursion issues. page_cache_read is, however, not called from the fs layera directly so it doesn't need this protection normally. ceph and ocfs2 which call filemap_fault from their fault handlers seem to be OK because they are not taking any fs lock before invoking generic implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe from the reclaim recursion POV because this lock serializes truncate and punch hole with the page faults and it doesn't get involved in the reclaim. The GFP_NOFS protection might be even harmful. There is a push to fail GFP_NOFS allocations rather than loop within allocator indefinitely with a very limited reclaim ability. Once we start failing those requests the OOM killer might be triggered prematurely because the page cache allocation failure is propagated up the page fault path and end up in pagefault_out_of_memory. We cannot play with mapping_gfp_mask directly because that would be racy wrt. parallel page faults and it might interfere with other users who really rely on NOFS semantic from the stored gfp_mask. The mask is also inode proper so it would even be a layering violation. What we can do instead is to push the gfp_mask into struct vm_fault and allow fs layer to overwrite it should the callback need to be called with a different allocation context. Initialize the default to (mapping_gfp_mask | GFP_IOFS) because this should be safe from the page fault path normally. Why do we care about mapping_gfp_mask at all then? Because this doesn't hold only reclaim protection flags but it also might contain zone and movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have to respect those. Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> Signed-off-by: Michal Hocko <mhocko@xxxxxxx> --- include/linux/mm.h | 4 ++++ mm/filemap.c | 9 ++++----- mm/memory.c | 3 +++ 3 files changed, 11 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index b720b5146a4e..c919e48664d5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -217,10 +217,14 @@ extern pgprot_t protection_map[16]; * ->fault function. The vma's ->fault is responsible for returning a bitmask * of VM_FAULT_xxx flags that give details about how the fault was handled. * + * MM layer fills up gfp_mask for page allocations but fault handler might + * alter it if its implementation requires a different allocation context. + * * pgoff should be used in favour of virtual_address, if possible. */ struct vm_fault { unsigned int flags; /* FAULT_FLAG_xxx flags */ + gfp_t gfp_mask /* gfp mask to be used for allocations */ pgoff_t pgoff; /* Logical page offset based on vma */ void __user *virtual_address; /* Faulting virtual address */ diff --git a/mm/filemap.c b/mm/filemap.c index 4756cba51655..4e620a79ab4d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1746,19 +1746,18 @@ EXPORT_SYMBOL(generic_file_read_iter); * This adds the requested page to the page cache if it isn't already there, * and schedules an I/O to read in its contents from disk. */ -static int page_cache_read(struct file *file, pgoff_t offset) +static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask) { struct address_space *mapping = file->f_mapping; struct page *page; int ret; do { - page = page_cache_alloc_cold(mapping); + page = __page_cache_alloc(gfp_mask|__GFP_COLD); if (!page) return -ENOMEM; - ret = add_to_page_cache_lru(page, mapping, offset, - GFP_KERNEL & mapping_gfp_mask(mapping)); + ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL & gfp_mask); if (ret == 0) ret = mapping->a_ops->readpage(file, page); else if (ret == -EEXIST) @@ -1941,7 +1940,7 @@ no_cached_page: * We're only likely to ever get here if MADV_RANDOM is in * effect. */ - error = page_cache_read(file, offset); + error = page_cache_read(file, offset, vmf.gfp_mask); /* * The page we want has now been added to the page cache. diff --git a/mm/memory.c b/mm/memory.c index c4b08e3ef058..ae350c752d54 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1962,6 +1962,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, vmf.virtual_address = (void __user *)(address & PAGE_MASK); vmf.pgoff = page->index; vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE; + vmf.gfp_mask = mapping_gfp_mask(vma->vm_file->f_mapping) | GFP_IOFS; vmf.page = page; ret = vma->vm_ops->page_mkwrite(vma, &vmf); @@ -2705,6 +2706,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, vmf.virtual_address = (void __user *)(address & PAGE_MASK); vmf.pgoff = pgoff; vmf.flags = flags; + vmf.gfp_mask = mapping_gfp_mask(vma->vm_file->f_mapping) | GFP_IOFS; vmf.page = NULL; ret = vma->vm_ops->fault(vma, &vmf); @@ -2868,6 +2870,7 @@ static void do_fault_around(struct vm_area_struct *vma, unsigned long address, vmf.pgoff = pgoff; vmf.max_pgoff = max_pgoff; vmf.flags = flags; + vmf.gfp_mask = mapping_gfp_mask(vma->vm_file->f_mapping) | GFP_IOFS; vma->vm_ops->map_pages(vma, &vmf); } -- 2.1.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>