On Tue, Jul 05, 2022 at 04:00:36PM -0400, Josef Bacik wrote: > We have an application with a lot of threads that use a shared mmap > backed by tmpfs mounted with -o huge=within_size. This application > started leaking loads of huge pages when we upgraded to a recent kernel. > > Using the page ref tracepoints and a BPF program written by Tejun Heo we > were able to determine that these pages would have multiple refcounts > from the page fault path, but when it came to unmap time we wouldn't > drop the number of refs we had added from the faults. > > I wrote a reproducer that mmap'ed a file backed by tmpfs with -o > huge=always, and then spawned 20 threads all looping faulting random > offsets in this map, while using madvise(MADV_DONTNEED) randomly for > huge page aligned ranges. This very quickly reproduced the problem. > > The problem here is that we check for the case that we have multiple > threads faulting in a range that was previously unmapped. One thread > maps the PMD, the other thread loses the race and then returns 0. > However at this point we already have the page, and we are no longer > putting this page into the processes address space, and so we leak the > page. We actually did the correct thing prior to f9ce0be71d1f, however > it looks like Kirill copied what we do in the anonymous page case. In > the anonymous page case we don't yet have a page, so we don't have to > drop a reference on anything. Previously we did the correct thing for > file based faults by returning VM_FAULT_NOPAGE so we correctly drop the > reference on the page we faulted in. > > Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable() > case, this makes us drop the ref on the page properly, and now my > reproducer no longer leaks the huge pages. > > Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") > Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> > Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx> > Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx> > Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx> > Signed-off-by: Chris Mason <clm@xxxxxx> Cc: stable@ ? > --- > mm/memory.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 7a089145cad4..f10724d7dca3 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4371,7 +4371,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) > > /* See comment in handle_pte_fault() */ > if (pmd_devmap_trans_unstable(vmf->pmd)) > - return 0; > + return VM_FAULT_NOPAGE; Comment update would be nice. Other instances of pmd_devmap_trans_unstable() return 0 in the fault path. Explanation would be helpful. Otherwise, Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> -- Kirill A. Shutemov