[patch 05/14] mm, memcg: fix reclaim deadlock with writeback

akpm@xxxxxxxxxxxxxxxxxxxx · Tue, 08 Jan 2019 15:23:07 -0800

From: Michal Hocko <mhocko@xxxxxxxx>
Subject: mm, memcg: fix reclaim deadlock with writeback

Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback

task1:
[<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
[<ffffffff811c5777>] shrink_page_list+0x907/0x960
[<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
[<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
[<ffffffff811c70a8>] shrink_node+0xd8/0x300
[<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
[<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
[<ffffffff8122df2d>] try_charge+0x14d/0x720
[<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
[<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
[<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
[<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
[<ffffffff81074247>] pte_alloc_one+0x17/0x40
[<ffffffff811e34de>] __pte_alloc+0x1e/0x110
[<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
[<ffffffff811e5d93>] do_fault+0x103/0x970
[<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
[<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
[<ffffffff8106ecb0>] do_page_fault+0x30/0x80
[<ffffffff8171bce8>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

task2:
[<ffffffff811aadc6>] __lock_page+0x86/0xa0
[<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
[<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
[<ffffffff811bbede>] do_writepages+0x1e/0x30
[<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
[<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
[<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
[<ffffffff81273568>] wb_writeback+0x268/0x300
[<ffffffff81273d24>] wb_workfn+0xb4/0x390
[<ffffffff810a2f19>] process_one_work+0x189/0x420
[<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
[<ffffffff810a9786>] kthread+0xe6/0x100
[<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
[<ffffffffffffffff>] 0xffffffffffffffff

He adds
: task1 is waiting for the PageWriteback bit of the page that task2 has
: collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
: bit the page which tasks1 has locked.

More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg.  That in turn hits a memory
limit reclaim and the memcg reclaim for legacy controller is waiting on
the writeback but that is never going to finish because the writeback
itself is waiting for the page locked in the #PF path.  So this is
essentially ABBA deadlock:

                                        lock_page(A)
                                        SetPageWriteback(A)
                                        unlock_page(A)
lock_page(B)
                                        lock_page(B)
pte_alloc_pne
  shrink_page_list
    wait_on_page_writeback(A)
                                        SetPageWriteback(B)
                                        unlock_page(B)

                                        # flush A, B to clear the writeback

This accumulating of more pages to flush is used by several filesystems to
generate a more optimal IO patterns.

Waiting for the writeback in legacy memcg controller is a workaround for
pre-mature OOM killer invocations because there is no dirty IO throttling
available for the controller.  There is no easy way around that
unfortunately.  Therefore fix this specific issue by pre-allocating the
page table outside of the page lock.  We have that handy infrastructure
for that already so simply reuse the fault-around pattern which already
does this.

There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare.  I am not
aware of a better solution unfortunately.

[akpm@xxxxxxxxxxxxxxxxxxxx: fix mm/memory.c:__do_fault()]
[akpm@xxxxxxxxxxxxxxxxxxxx: coding-style fixes]
[mhocko@xxxxxxxxxx: enhance comment, per Johannes]
  Link: http://lkml.kernel.org/r/20181214084948.GA5624@xxxxxxxxxxxxxx
Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@xxxxxxxxxx
Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
Reported-by: Liu Bo <bo.liu@xxxxxxxxxxxxxxxxx>
Debugged-by: Liu Bo <bo.liu@xxxxxxxxxxxxxxxxx>
Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Reviewed-by: Liu Bo <bo.liu@xxxxxxxxxxxxxxxxx>
Cc: Jan Kara <jack@xxxxxxx>
Cc: Dave Chinner <david@xxxxxxxxxxxxx>
Cc: Theodore Ts'o <tytso@xxxxxxx>
Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/memory.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

--- a/mm/memory.c~mm-memcg-fix-reclaim-deadlock-with-writeback
+++ a/mm/memory.c
@@ -2994,6 +2994,28 @@ static vm_fault_t __do_fault(struct vm_f
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret;
 
+	/*
+	 * Preallocate pte before we take page_lock because this might lead to
+	 * deadlocks for memcg reclaim which waits for pages under writeback:
+	 *				lock_page(A)
+	 *				SetPageWriteback(A)
+	 *				unlock_page(A)
+	 * lock_page(B)
+	 *				lock_page(B)
+	 * pte_alloc_pne
+	 *   shrink_page_list
+	 *     wait_on_page_writeback(A)
+	 *				SetPageWriteback(B)
+	 *				unlock_page(B)
+	 *				# flush A, B to clear the writeback
+	 */
+	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
+		if (!vmf->prealloc_pte)
+			return VM_FAULT_OOM;
+		smp_wmb(); /* See comment in __pte_alloc() */
+	}
+
 	ret = vma->vm_ops->fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
 			    VM_FAULT_DONE_COW)))
_