On 23 April 2018 4:16:36 AM IST, Michal Hocko <mhocko@xxxxxxxxxx> wrote: >I am not reallu sure it is a good idea to blindly apply this patch to >the older stable tree. Have you checked all filemap_fault handlers? >This might have been quite different in 3.18 than for the kernel I was >developing this against. I did, but it's likely that I missed a few instances. >If the sole purpose of this backport is to make other >patch (abc1be13fd11 ("mm/filemap.c: fix NULL pointer in >page_cache_tree_insert()")) apply easier then I've already suggested >how >to handle those rejects. I'll look for the email after this exam and spin up a fixed series, thanks for the heads-up! >On Mon 23-04-18 01:37:44, Harsh Shandilya wrote: >> From: Michal Hocko <mhocko@xxxxxxxx> >> >> Commit c20cd45eb01748f0fba77a504f956b000df4ea73 upstream. >> >> page_cache_read has been historically using page_cache_alloc_cold to >> allocate a new page. This means that mapping_gfp_mask is used as the >> base for the gfp_mask. Many filesystems are setting this mask to >> GFP_NOFS to prevent from fs recursion issues. page_cache_read is >called >> from the vm_operations_struct::fault() context during the page fault. >> This context doesn't need the reclaim protection normally. >> >> ceph and ocfs2 which call filemap_fault from their fault handlers >seem >> to be OK because they are not taking any fs lock before invoking >generic >> implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe from the >> reclaim recursion POV because this lock serializes truncate and punch >> hole with the page faults and it doesn't get involved in the reclaim. >> >> There is simply no reason to deliberately use a weaker allocation >> context when a __GFP_FS | __GFP_IO can be used. The GFP_NOFS >protection >> might be even harmful. There is a push to fail GFP_NOFS allocations >> rather than loop within allocator indefinitely with a very limited >> reclaim ability. Once we start failing those requests the OOM killer >> might be triggered prematurely because the page cache allocation >failure >> is propagated up the page fault path and end up in >> pagefault_out_of_memory. >> >> We cannot play with mapping_gfp_mask directly because that would be >racy >> wrt. parallel page faults and it might interfere with other users >who >> really rely on NOFS semantic from the stored gfp_mask. The mask is >also >> inode proper so it would even be a layering violation. What we can >do >> instead is to push the gfp_mask into struct vm_fault and allow fs >layer >> to overwrite it should the callback need to be called with a >different >> allocation context. >> >> Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO) >> because this should be safe from the page fault path normally. Why >do >> we care about mapping_gfp_mask at all then? Because this doesn't hold >> only reclaim protection flags but it also might contain zone and >> movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we >have >> to respect those. >> >> Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> >> Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> >> Acked-by: Jan Kara <jack@xxxxxxxx> >> Acked-by: Vlastimil Babka <vbabka@xxxxxxx> >> Cc: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> >> Cc: Mel Gorman <mgorman@xxxxxxx> >> Cc: Dave Chinner <david@xxxxxxxxxxxxx> >> Cc: Mark Fasheh <mfasheh@xxxxxxxx> >> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> >> Signed-off-by: Harsh Shandilya <harsh@xxxxxxxx> >> --- >> include/linux/mm.h | 4 ++++ >> mm/filemap.c | 8 ++++---- >> mm/memory.c | 17 +++++++++++++++++ >> 3 files changed, 25 insertions(+), 4 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 5adffb0a468f..9ac4697979e8 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -203,9 +203,13 @@ extern pgprot_t protection_map[16]; >> * >> * pgoff should be used in favour of virtual_address, if possible. >If pgoff >> * is used, one may implement ->remap_pages to get nonlinear mapping >support. >> + * >> + * MM layer fills up gfp_mask for page allocations but fault handler >might >> + * alter it if its implementation requires a different allocation >context. >> */ >> struct vm_fault { >> unsigned int flags; /* FAULT_FLAG_xxx flags */ >> + gfp_t gfp_mask; /* gfp mask to be used for allocations */ >> pgoff_t pgoff; /* Logical page offset based on vma */ >> void __user *virtual_address; /* Faulting virtual address */ >> >> diff --git a/mm/filemap.c b/mm/filemap.c >> index 7e6ab98d4d3c..aafeeefcb00d 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -1746,18 +1746,18 @@ EXPORT_SYMBOL(generic_file_read_iter); >> * This adds the requested page to the page cache if it isn't >already there, >> * and schedules an I/O to read in its contents from disk. >> */ >> -static int page_cache_read(struct file *file, pgoff_t offset) >> +static int page_cache_read(struct file *file, pgoff_t offset, gfp_t >gfp_mask) >> { >> struct address_space *mapping = file->f_mapping; >> struct page *page; >> int ret; >> >> do { >> - page = page_cache_alloc_cold(mapping); >> + page = __page_cache_alloc(gfp_mask|__GFP_COLD); >> if (!page) >> return -ENOMEM; >> >> - ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL); >> + ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask & >GFP_KERNEL); >> if (ret == 0) >> ret = mapping->a_ops->readpage(file, page); >> else if (ret == -EEXIST) >> @@ -1940,7 +1940,7 @@ no_cached_page: >> * We're only likely to ever get here if MADV_RANDOM is in >> * effect. >> */ >> - error = page_cache_read(file, offset); >> + error = page_cache_read(file, offset, vmf->gfp_mask); >> >> /* >> * The page we want has now been added to the page cache. >> diff --git a/mm/memory.c b/mm/memory.c >> index 0c4f5e36b155..5a62c6a42143 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -1973,6 +1973,20 @@ static inline void cow_user_page(struct page >*dst, struct page *src, unsigned lo >> copy_user_highpage(dst, src, va, vma); >> } >> >> +static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma) >> +{ >> + struct file *vm_file = vma->vm_file; >> + >> + if (vm_file) >> + return mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO; >> + >> + /* >> + * Special mappings (e.g. VDSO) do not have any file so fake >> + * a default GFP_KERNEL for them. >> + */ >> + return GFP_KERNEL; >> +} >> + >> /* >> * Notify the address space that the page is about to become >writable so that >> * it can prohibit this or wait for the page to get into an >appropriate state. >> @@ -1988,6 +2002,7 @@ static int do_page_mkwrite(struct >vm_area_struct *vma, struct page *page, >> vmf.virtual_address = (void __user *)(address & PAGE_MASK); >> vmf.pgoff = page->index; >> vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE; >> + vmf.gfp_mask = __get_fault_gfp_mask(vma); >> vmf.page = page; >> >> ret = vma->vm_ops->page_mkwrite(vma, &vmf); >> @@ -2670,6 +2685,7 @@ static int __do_fault(struct vm_area_struct >*vma, unsigned long address, >> vmf.pgoff = pgoff; >> vmf.flags = flags; >> vmf.page = NULL; >> + vmf.gfp_mask = __get_fault_gfp_mask(vma); >> >> ret = vma->vm_ops->fault(vma, &vmf); >> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | >VM_FAULT_RETRY))) >> @@ -2834,6 +2850,7 @@ static void do_fault_around(struct >vm_area_struct *vma, unsigned long address, >> vmf.pgoff = pgoff; >> vmf.max_pgoff = max_pgoff; >> vmf.flags = flags; >> + vmf.gfp_mask = __get_fault_gfp_mask(vma); >> vma->vm_ops->map_pages(vma, &vmf); >> } >> >> -- >> 2.15.0.2308.g658a28aa74af >> -- Harsh Shandilya, PRJKT Development LLC