Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation

Steven Sistare <steven.sistare@xxxxxxxxxx> · Fri, 10 Jan 2025 10:22:47 -0500

On 1/10/2025 1:17 AM, Kasireddy, Vivek wrote:
Hi Steve,

Subject: Re: [PATCH] mm/memfd: reserve hugetlb folios before allocation

There are cases when we try to pin a folio but discover that it has
not been faulted-in. So, we try to allocate it in memfd_alloc_folio()
but there is a chance that we might encounter a crash/failure
(VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations
at that instant. This issue was reported by syzbot:

kernel BUG at mm/hugetlb.c:2403!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted
6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403
Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89
f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66
90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f
RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087
RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000
RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed
RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005
R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000
R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8
FS:  00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4:
0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
    <TASK>
    memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88
    memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750
    udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline]
    udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443
    udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline]
    udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:906 [inline]
    __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

Therefore, to avoid this situation and fix this issue, we just need
to make a reservation before we try to allocate the folio. While at
it, also remove the VM_BUG_ON() as there is no need to crash the
system in this scenario and instead we could just fail the allocation.

Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios
resv_huge_pages
leak")
Reported-by:
syzbot+a504cb5bae4fe117ba94@xxxxxxxxxxxxxxxxxxxxxxxxx
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@xxxxxxxxx>
Cc: Steve Sistare <steven.sistare@xxxxxxxxxx>
Cc: Muchun Song <muchun.song@xxxxxxxxx>
Cc: David Hildenbrand <david@xxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---
    mm/hugetlb.c | 9 ++++++---
    mm/memfd.c   | 5 +++++
    2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c498874a7170..e46c461210a4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2397,12 +2397,15 @@ struct folio
*alloc_hugetlb_folio_reserve(struct
hstate *h, int preferred_nid,
    	struct folio *folio;

    	spin_lock_irq(&hugetlb_lock);
+	if (!h->resv_huge_pages) {
+		spin_unlock_irq(&hugetlb_lock);
+		return NULL;
+	}

This should be the entire fix, plus deleting the VM_BUG_ON.  See below.

+
    	folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
preferred_nid,
    					       nmask);
-	if (folio) {
-		VM_BUG_ON(!h->resv_huge_pages);
+	if (folio)
    		h->resv_huge_pages--;
-	}

    	spin_unlock_irq(&hugetlb_lock);
    	return folio;
diff --git a/mm/memfd.c b/mm/memfd.c
index 35a370d75c9a..a3012c444285 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -85,6 +85,10 @@ struct folio *memfd_alloc_folio(struct file
*memfd,
pgoff_t idx)
    		gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
    		idx >>= huge_page_order(h);

+		if (!hugetlb_reserve_pages(file_inode(memfd),
+					   idx, idx + 1, NULL, 0))
+			return ERR_PTR(-ENOMEM);

I believe it is wrong to force a reservation here.
Is there any particular reason why you believe a reservation here would be
wrong?
AFAICS, at the moment, we are not doing any region/subpool accounting
before
our folio allocation and this gets flagged in the form of elevated
resv_huge_pages
value (hugetlb_acct_memory() elevates it based on the return value of
region_del())
when hugetlb_unreserve_pages() eventually gets called.

Pages should have already been
reserved at this point, eg by calls from hugetlbfs_file_mmap or
hugetlb_file_setup.
hugetlb_file_setup() does not reserve any pages as it passes in
VM_NORESERVE.
And, the case we are trying to address is exactly when
hugetlbfs_file_mmap() does
not get called before pinning.

But you must not break the case where hugetlbfs_file_mmap was called first, which
AFAICS, this patch does not break the primary use-case where hugetlbfs_file_mmap()
gets called first.

reserves, then memfd_alloc_folio is called, which reserves again with your fix. Does
that work correctly, or do the counts go bad?
Yes it works correctly and ` cat /proc/meminfo|grep Huge` does not show any
unexpected counts.


So, when hugetlbfs_file_mmap() does eventually
get called, I don't see any problem if it calls hugetlb_reserve_pages() again
for the
same range or overlapping ranges.

Does that work correctly, or do the counts go bad?
It works correctly as expected. And, as mentioned to David in another email,
if a range is already reserved, and when hugetlb_reserve_pages() gets called
again for the same range (in this use-case), it would mostly become a no-op
as region_chg() and region_add() return 0 (based on my light testing).

Great, thanks.

Please try those scenarios with your test program:  mmap +
memfd_alloc_folio, and
memfd_alloc_folio + mmap.
Here are the scenarios I tried (with the calls in the exact order as below):
1) hugetlbfs_file_mmap()
     memfd_alloc_folio()
     hugetlb_fault()

2) memfd_alloc_folio()
     hugetlbfs_file_mmap()
     hugetlb_fault()

3) hugetlbfs_file_mmap()
     hugetlb_fault()
         alloc_hugetlb_folio()

memfd_alloc_folio() does not get called in the last case as the folio is already
allocated (and is present in the page cache). While testing all these cases, I
did not notice any splats or messed-up counts.

If it is not too much trouble, could you please try this patch with your test-cases?

Done, and all tests pass.  I will review your V2 patch.
Thanks for working on this bug.

- Steve

syzcaller has forced its way down this path without calling those pre-
requisites,
doing weird stuff as it should.
This issue is not very hard to reproduce. If we have free_huge_pages > 0
and
resv_huge_pages = 0, and then we call memfd_pin_folios() before mmap()/
hugetlbfs_file_mmap() we can easily encounter this issue. Furthermore, we
should be able to allocate a folio in this scenario (as available_huge_pages
0),
which we would not be able to do if we don't call hugetlb_reserve_pages().
Note that hugetlb_reserve_pages() actually elevates resv_huge_pages in
this case and kind of gives a go-ahead for the allocation.

I have used a slightly modified udmabuf selftest to reproduce this issue
which
I'll send out as part of v2.

Thanks,
Vivek


To fix, I suggest you simply fix alloc_hugetlb_folio_reserve as above.

- Steve

+
    		folio = alloc_hugetlb_folio_reserve(h,
    						    numa_node_id(),
    						    NULL,
@@ -100,6 +104,7 @@ struct folio *memfd_alloc_folio(struct file
*memfd,
pgoff_t idx)
    			folio_unlock(folio);
    			return folio;
    		}
+		hugetlb_unreserve_pages(file_inode(memfd), idx, idx + 1, 1);
    		return ERR_PTR(-ENOMEM);
    	}
    #endif