On Fri 07-12-18 10:51:04, Liu Bo wrote: > On Fri, Dec 07, 2018 at 12:20:36PM +0100, Michal Hocko wrote: > > On Fri 07-12-18 08:16:15, Michal Hocko wrote: > > [...] > > > Memcg v1 indeed doesn't have any dirty IO throttling and this is a > > > poor's man workaround. We still do not have that AFAIK and I do not know > > > of an elegant way around that. Fortunatelly we shouldn't have that many > > > GFP_KERNEL | __GFP_ACCOUNT allocations under page lock and we can work > > > around this specific one quite easily. I haven't tested this yet but the > > > following should work > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 4ad2d293ddc2..59c98eeb0260 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -2993,6 +2993,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) > > > struct vm_area_struct *vma = vmf->vma; > > > vm_fault_t ret; > > > > > > + /* > > > + * Preallocate pte before we take page_lock because this might lead to > > > + * deadlocks for memcg reclaim which waits for pages under writeback. > > > + */ > > > + if (!vmf->prealloc_pte) { > > > + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address); > > > + if (!vmf->prealloc_pte) > > > + return VM_FAULT_OOM; > > > + } > > > + > > > ret = vma->vm_ops->fault(vmf); > > > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | > > > VM_FAULT_DONE_COW))) > > > > This is too eager to allocate pte even when it is not really needed. > > Jack has also pointed out that I am missing a write barrier. So here we > > go with an updated patch. This is essentially what fault around code > > does. > > > > Makes sense to me, unfortunately we don't have a local reproducer to verify it > and we've disabled CONFIG_MEMCG_KMEM to workaround the problem. Given the stack > I put, the patch should address the deadlock at least. OK, I will send a full patch with the changelog tomorrow. -- Michal Hocko SUSE Labs