Re: FAILED: patch "[PATCH] mm, memcg: fix reclaim deadlock with writeback" failed to apply to 4.4-stable tree

Michal Hocko <mhocko@xxxxxxxxxx> · Tue, 15 Jan 2019 16:34:44 +0100

I do not see a straightforward backport of this patch without pulling
more changes in. Do we have anybody to actually hit the issue on those
older kernels? While the issue is possible in principle I do not
remember anybody complaining.

On Mon 14-01-19 15:57:16, Greg KH wrote:
> >From 63f3655f950186752236bb88a22f8252c11ce394 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@xxxxxxxx>
> Date: Tue, 8 Jan 2019 15:23:07 -0800
> Subject: [PATCH] mm, memcg: fix reclaim deadlock with writeback
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> 
>   task1:
>     wait_on_page_bit+0x82/0xa0
>     shrink_page_list+0x907/0x960
>     shrink_inactive_list+0x2c7/0x680
>     shrink_node_memcg+0x404/0x830
>     shrink_node+0xd8/0x300
>     do_try_to_free_pages+0x10d/0x330
>     try_to_free_mem_cgroup_pages+0xd5/0x1b0
>     try_charge+0x14d/0x720
>     memcg_kmem_charge_memcg+0x3c/0xa0
>     memcg_kmem_charge+0x7e/0xd0
>     __alloc_pages_nodemask+0x178/0x260
>     alloc_pages_current+0x95/0x140
>     pte_alloc_one+0x17/0x40
>     __pte_alloc+0x1e/0x110
>     alloc_set_pte+0x5fe/0xc20
>     do_fault+0x103/0x970
>     handle_mm_fault+0x61e/0xd10
>     __do_page_fault+0x252/0x4d0
>     do_page_fault+0x30/0x80
>     page_fault+0x28/0x30
> 
>   task2:
>     __lock_page+0x86/0xa0
>     mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
>     ext4_writepages+0x479/0xd60
>     do_writepages+0x1e/0x30
>     __writeback_single_inode+0x45/0x320
>     writeback_sb_inodes+0x272/0x600
>     __writeback_inodes_wb+0x92/0xc0
>     wb_writeback+0x268/0x300
>     wb_workfn+0xb4/0x390
>     process_one_work+0x189/0x420
>     worker_thread+0x4e/0x4b0
>     kthread+0xe6/0x100
>     ret_from_fork+0x41/0x50
> 
> He adds
>  "task1 is waiting for the PageWriteback bit of the page that task2 has
>   collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
>   LOCKED bit the page which tasks1 has locked"
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg.  That in turn hits a
> memory limit reclaim and the memcg reclaim for legacy controller is
> waiting on the writeback but that is never going to finish because the
> writeback itself is waiting for the page locked in the #PF path.  So
> this is essentially ABBA deadlock:
> 
>                                         lock_page(A)
>                                         SetPageWriteback(A)
>                                         unlock_page(A)
>   lock_page(B)
>                                         lock_page(B)
>   pte_alloc_pne
>     shrink_page_list
>       wait_on_page_writeback(A)
>                                         SetPageWriteback(B)
>                                         unlock_page(B)
> 
>                                         # flush A, B to clear the writeback
> 
> This accumulating of more pages to flush is used by several filesystems
> to generate a more optimal IO patterns.
> 
> Waiting for the writeback in legacy memcg controller is a workaround for
> pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller.  There is no easy way around
> that unfortunately.  Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock.  We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare.  I am not
> aware of a better solution unfortunately.
> 
> [akpm@xxxxxxxxxxxxxxxxxxxx: fix mm/memory.c:__do_fault()]
> [akpm@xxxxxxxxxxxxxxxxxxxx: coding-style fixes]
> [mhocko@xxxxxxxxxx: enhance comment, per Johannes]
>   Link: http://lkml.kernel.org/r/20181214084948.GA5624@xxxxxxxxxxxxxx
> Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@xxxxxxxxxx
> Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
> Reported-by: Liu Bo <bo.liu@xxxxxxxxxxxxxxxxx>
> Debugged-by: Liu Bo <bo.liu@xxxxxxxxxxxxxxxxx>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
> Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Reviewed-by: Liu Bo <bo.liu@xxxxxxxxxxxxxxxxx>
> Cc: Jan Kara <jack@xxxxxxx>
> Cc: Dave Chinner <david@xxxxxxxxxxxxx>
> Cc: Theodore Ts'o <tytso@xxxxxxx>
> Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
> Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index a52663c0612d..5e46836714dc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2994,6 +2994,28 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	vm_fault_t ret;
>  
> +	/*
> +	 * Preallocate pte before we take page_lock because this might lead to
> +	 * deadlocks for memcg reclaim which waits for pages under writeback:
> +	 *				lock_page(A)
> +	 *				SetPageWriteback(A)
> +	 *				unlock_page(A)
> +	 * lock_page(B)
> +	 *				lock_page(B)
> +	 * pte_alloc_pne
> +	 *   shrink_page_list
> +	 *     wait_on_page_writeback(A)
> +	 *				SetPageWriteback(B)
> +	 *				unlock_page(B)
> +	 *				# flush A, B to clear the writeback
> +	 */
> +	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> +		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
> +		if (!vmf->prealloc_pte)
> +			return VM_FAULT_OOM;
> +		smp_wmb(); /* See comment in __pte_alloc() */
> +	}
> +
>  	ret = vma->vm_ops->fault(vmf);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>  			    VM_FAULT_DONE_COW)))
> 

-- 
Michal Hocko
SUSE Labs