RE: [PATCH] mm/memfd: reserve hugetlb folios before allocation

"Kasireddy, Vivek" <vivek.kasireddy@xxxxxxxxx> · Fri, 10 Jan 2025 06:17:31 +0000

Hi Steve,

> Subject: Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation
> 
> >>> There are cases when we try to pin a folio but discover that it has
> >>> not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
> >>> but there is a chance that we might encounter a crash/failure
> >>> (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
> >>> at that instant. This issue was reported by syzbot:
> >>>
> >>> kernel BUG at mm/hugetlb.c:2403!
> >>> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> >>> CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
> >>> 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
> >>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> >>> 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> >>> RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
> >>> Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
> >>> f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
> >>> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
> >>> RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
> >>> RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
> >>> RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
> >>> RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
> >>> R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
> >>> R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
> >>> FS:  00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
> >>> knlGS:0000000000000000
> >>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4:
> 0000000000352ef0
> >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>> Call Trace:
> >>>    <TASK>
> >>>    memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
> >>>    memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
> >>>    udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
> >>>    udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
> >>>    udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
> >>>    udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
> >>>    vfs_ioctl fs/ioctl.c:51 [inline]
> >>>    __do_sys_ioctl fs/ioctl.c:906 [inline]
> >>>    __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
> >>>    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >>>    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> >>>    entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >>>
> >>> Therefore, to avoid this situation and fix this issue, we just need
> >>> to make a reservation before we try to allocate the folio. While at
> >>> it, also remove the VM_BUG_ON() as there is no need to crash the
> >>> system in this scenario and instead we could just fail the allocation.
> >>>
> >>> Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios
> resv_huge_pages
> >> leak")
> >>> Reported-by:
> syzbot+a504cb5bae4fe117ba94@xxxxxxxxxxxxxxxxxxxxxxxxx
> >>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@xxxxxxxxx>
> >>> Cc: Steve Sistare <steven.sistare@xxxxxxxxxx>
> >>> Cc: Muchun Song <muchun.song@xxxxxxxxx>
> >>> Cc: David Hildenbrand <david@xxxxxxxxxx>
> >>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> >>> ---
> >>>    mm/hugetlb.c | 9 ++++++---
> >>>    mm/memfd.c   | 5 +++++
> >>>    2 files changed, 11 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>> index c498874a7170..e46c461210a4 100644
> >>> --- a/mm/hugetlb.c
> >>> +++ b/mm/hugetlb.c
> >>> @@ -2397,12 +2397,15 @@ struct folio
> *alloc_hugetlb_folio_reserve(struct
> >> hstate *h, int preferred_nid,
> >>>    	struct folio *folio;
> >>>
> >>>    	spin_lock_irq(&hugetlb_lock);
> >>> +	if (!h->resv_huge_pages) {
> >>> +		spin_unlock_irq(&hugetlb_lock);
> >>> +		return NULL;
> >>> +	}
> >>
> >> This should be the entire fix, plus deleting the VM_BUG_ON.  See below.
> >>
> >>> +
> >>>    	folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
> >> preferred_nid,
> >>>    					       nmask);
> >>> -	if (folio) {
> >>> -		VM_BUG_ON(!h->resv_huge_pages);
> >>> +	if (folio)
> >>>    		h->resv_huge_pages--;
> >>> -	}
> >>>
> >>>    	spin_unlock_irq(&hugetlb_lock);
> >>>    	return folio;
> >>> diff --git a/mm/memfd.c b/mm/memfd.c
> >>> index 35a370d75c9a..a3012c444285 100644
> >>> --- a/mm/memfd.c
> >>> +++ b/mm/memfd.c
> >>> @@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file
> *memfd,
> >> pgoff_t idx)
> >>>    		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
> >>>    		idx >>= huge_page_order(h);
> >>>
> >>> +		if (!hugetlb_reserve_pages(file_inode(memfd),
> >>> +					   idx, idx + 1, NULL, 0))
> >>> +			return ERR_PTR(-ENOMEM);
> >>
> >> I believe it is wrong to force a reservation here.
> > Is there any particular reason why you believe a reservation here would be
> wrong?
> > AFAICS, at the moment, we are not doing any region/subpool accounting
> before
> > our folio allocation and this gets flagged in the form of elevated
> resv_huge_pages
> > value (hugetlb_acct_memory() elevates it based on the return value of
> region_del())
> > when hugetlb_unreserve_pages() eventually gets called.
> >
> >> Pages should have already been
> >> reserved at this point, eg by calls from hugetlbfs_file_mmap or
> hugetlb_file_setup.
> > hugetlb_file_setup() does not reserve any pages as it passes in
> VM_NORESERVE.
> > And, the case we are trying to address is exactly when
> hugetlbfs_file_mmap() does
> > not get called before pinning.
> 
> But you must not break the case where hugetlbfs_file_mmap was called first, which
AFAICS, this patch does not break the primary use-case where hugetlbfs_file_mmap()
gets called first.

> reserves, then memfd_alloc_folio is called, which reserves again with your fix. Does
> that work correctly, or do the counts go bad?
Yes it works correctly and ` cat /proc/meminfo|grep Huge` does not show any
unexpected counts. 

> 
> > So, when hugetlbfs_file_mmap() does eventually
> > get called, I don't see any problem if it calls hugetlb_reserve_pages() again
> for the
> > same range or overlapping ranges.
> 
> Does that work correctly, or do the counts go bad?
It works correctly as expected. And, as mentioned to David in another email,
if a range is already reserved, and when hugetlb_reserve_pages() gets called
again for the same range (in this use-case), it would mostly become a no-op
as region_chg() and region_add() return 0 (based on my light testing).

> 
> Please try those scenarios with your test program:  mmap +
> memfd_alloc_folio, and
> memfd_alloc_folio + mmap.
Here are the scenarios I tried (with the calls in the exact order as below):
1) hugetlbfs_file_mmap()
    memfd_alloc_folio()
    hugetlb_fault()

2) memfd_alloc_folio()
    hugetlbfs_file_mmap()
    hugetlb_fault()

3) hugetlbfs_file_mmap()
    hugetlb_fault()
        alloc_hugetlb_folio()

memfd_alloc_folio() does not get called in the last case as the folio is already
allocated (and is present in the page cache). While testing all these cases, I
did not notice any splats or messed-up counts. 

If it is not too much trouble, could you please try this patch with your test-cases?

Thanks,
Vivek

> 
> - Steve
> 
> >> syzcaller has forced its way down this path without calling those pre-
> requisites,
> >> doing weird stuff as it should.
> > This issue is not very hard to reproduce. If we have free_huge_pages > 0
> and
> > resv_huge_pages = 0, and then we call memfd_pin_folios() before mmap()/
> > hugetlbfs_file_mmap() we can easily encounter this issue. Furthermore, we
> > should be able to allocate a folio in this scenario (as available_huge_pages
> > 0),
> > which we would not be able to do if we don't call hugetlb_reserve_pages().
> > Note that hugetlb_reserve_pages() actually elevates resv_huge_pages in
> > this case and kind of gives a go-ahead for the allocation.
> >
> > I have used a slightly modified udmabuf selftest to reproduce this issue
> which
> > I'll send out as part of v2.
> >
> > Thanks,
> > Vivek
> >
> >>
> >> To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above.
> >>
> >> - Steve
> >>
> >>> +
> >>>    		folio = alloc_hugetlb_folio_reserve(h,
> >>>    						    numa_node_id(),
> >>>    						    NULL,
> >>> @@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file
> *memfd,
> >> pgoff_t idx)
> >>>    			folio_unlock(folio);
> >>>    			return folio;
> >>>    		}
> >>> +		hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
> >>>    		return ERR_PTR(-ENOMEM);
> >>>    	}
> >>>    #endif
> >