Re: [PATCH] mm: fix page leak with multiple threads mapping the same page

"Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx> · Thu, 7 Jul 2022 01:46:57 +0300

On Tue, Jul 05, 2022 at 04:00:36PM -0400, Josef Bacik wrote:
> We have an application with a lot of threads that use a shared mmap
> backed by tmpfs mounted with -o huge=within_size.  This application
> started leaking loads of huge pages when we upgraded to a recent kernel.
> 
> Using the page ref tracepoints and a BPF program written by Tejun Heo we
> were able to determine that these pages would have multiple refcounts
> from the page fault path, but when it came to unmap time we wouldn't
> drop the number of refs we had added from the faults.
> 
> I wrote a reproducer that mmap'ed a file backed by tmpfs with -o
> huge=always, and then spawned 20 threads all looping faulting random
> offsets in this map, while using madvise(MADV_DONTNEED) randomly for
> huge page aligned ranges.  This very quickly reproduced the problem.
> 
> The problem here is that we check for the case that we have multiple
> threads faulting in a range that was previously unmapped.  One thread
> maps the PMD, the other thread loses the race and then returns 0.
> However at this point we already have the page, and we are no longer
> putting this page into the processes address space, and so we leak the
> page.  We actually did the correct thing prior to f9ce0be71d1f, however
> it looks like Kirill copied what we do in the anonymous page case.  In
> the anonymous page case we don't yet have a page, so we don't have to
> drop a reference on anything.  Previously we did the correct thing for
> file based faults by returning VM_FAULT_NOPAGE so we correctly drop the
> reference on the page we faulted in.
> 
> Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable()
> case, this makes us drop the ref on the page properly, and now my
> reproducer no longer leaks the huge pages.
> 
> Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
> Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
> Signed-off-by: Chris Mason <clm@xxxxxx>

Cc: stable@ ?

> ---
>  mm/memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 7a089145cad4..f10724d7dca3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4371,7 +4371,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>  
>  	/* See comment in handle_pte_fault() */
>  	if (pmd_devmap_trans_unstable(vmf->pmd))
> -		return 0;
> +		return VM_FAULT_NOPAGE;

Comment update would be nice.

Other instances of pmd_devmap_trans_unstable() return 0 in the fault path.
Explanation would be helpful.

Otherwise,

Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>

-- 
 Kirill A. Shutemov