Hi David, > Subject: Re: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before > allocation > > On 14.01.25 09:08, Vivek Kasireddy wrote: > > There are cases when we try to pin a folio but discover that it has > > not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > > but there is a chance that we might encounter a crash/failure > > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > > at that instant. This issue was reported by syzbot: > > > > kernel BUG at mm/hugetlb.c:2403! > > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI > > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted > > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 > > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 > > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 > > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f > > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 > > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 > > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed > > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 > > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 > > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 > > FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) > > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > Call Trace: > > <TASK> > > memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 > > memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 > > udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] > > udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 > > udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] > > udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 > > vfs_ioctl fs/ioctl.c:51 [inline] > > __do_sys_ioctl fs/ioctl.c:906 [inline] > > __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > Therefore, to avoid this situation and fix this issue, we just need > > to make a reservation (by calling hugetlb_reserve_pages()) before > > we try to allocate the folio. This will ensure that we are properly > > doing region/subpool accounting associated with our allocation. > > > > While at it, move subpool_inode() into hugetlb header and also > > replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to > > crash the system in this scenario and instead we could just warn > > and fail the allocation. > > > > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages > leak") > > Reported-by: syzbot+a504cb5bae4fe117ba94@xxxxxxxxxxxxxxxxxxxxxxxxx > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@xxxxxxxxx> > > Cc: Steve Sistare <steven.sistare@xxxxxxxxxx> > > Cc: Muchun Song <muchun.song@xxxxxxxxx> > > Cc: David Hildenbrand <david@xxxxxxxxxx> > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > --- > > include/linux/hugetlb.h | 5 +++++ > > mm/hugetlb.c | 14 ++++++-------- > > mm/memfd.c | 14 +++++++++++--- > > 3 files changed, 22 insertions(+), 11 deletions(-) > > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > index ae4fe8615bb6..38c580548564 100644 > > --- a/include/linux/hugetlb.h > > +++ b/include/linux/hugetlb.h > > @@ -712,6 +712,11 @@ extern unsigned int default_hstate_idx; > > > > #define default_hstate (hstates[default_hstate_idx]) > > > > +static inline struct hugepage_subpool *subpool_inode(struct inode > *inode) > > +{ > > + return HUGETLBFS_SB(inode->i_sb)->spool; > > +} > > + > > static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio > *folio) > > { > > return folio->_hugetlb_subpool; > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index c498874a7170..ef948f56b864 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -251,11 +251,6 @@ static long hugepage_subpool_put_pages(struct > hugepage_subpool *spool, > > return ret; > > } > > > > -static inline struct hugepage_subpool *subpool_inode(struct inode > *inode) > > -{ > > - return HUGETLBFS_SB(inode->i_sb)->spool; > > -} > > - > > static inline struct hugepage_subpool *subpool_vma(struct > vm_area_struct *vma) > > { > > return subpool_inode(file_inode(vma->vm_file)); > > @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct > hstate *h, int preferred_nid, > > struct folio *folio; > > > > spin_lock_irq(&hugetlb_lock); > > + if (WARN_ON_ONCE(!h->resv_huge_pages)) { > > + spin_unlock_irq(&hugetlb_lock); > > + return NULL; > > + } > > + > > folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, > preferred_nid, > > nmask); > > - if (folio) { > > - VM_BUG_ON(!h->resv_huge_pages); > > + if (folio) > > h->resv_huge_pages--; > > - } > > > > spin_unlock_irq(&hugetlb_lock); > > return folio; > > diff --git a/mm/memfd.c b/mm/memfd.c > > index 35a370d75c9a..0d128c44fb78 100644 > > --- a/mm/memfd.c > > +++ b/mm/memfd.c > > @@ -70,7 +70,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, > pgoff_t idx) > > #ifdef CONFIG_HUGETLB_PAGE > > struct folio *folio; > > gfp_t gfp_mask; > > - int err; > > + int err = -ENOMEM; > > > > if (is_file_hugepages(memfd)) { > > /* > > @@ -79,12 +79,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, > pgoff_t idx) > > * alloc from. Also, the folio will be pinned for an indefinite > > * amount of time, so it is not expected to be migrated away. > > */ > > + struct inode *inode = file_inode(memfd); > > struct hstate *h = hstate_file(memfd); > > > > gfp_mask = htlb_alloc_mask(h); > > gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); > > idx >>= huge_page_order(h); > > > > + if (!hugetlb_reserve_pages(inode, idx, idx + 1, NULL, 0)) > > + return ERR_PTR(err); > > + > > folio = alloc_hugetlb_folio_reserve(h, > > numa_node_id(), > > NULL, > > @@ -95,12 +99,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, > pgoff_t idx) > > idx); > > if (err) { > > folio_put(folio); > > - return ERR_PTR(err); > > + goto err; > > } > > + > > + hugetlb_set_folio_subpool(folio, > subpool_inode(inode)); > > folio_unlock(folio); > > return folio; > > } > > - return ERR_PTR(-ENOMEM); > > +err: > > + hugetlb_unreserve_pages(inode, idx, idx + 1, 0); > > Hmmm, shouldn't we maybe only un-reserve if we were responsible for the > reservation above? Good catch! Yes, I agree that un-reserving, only if we were responsible for the reservation is the right thing to do in this case. > > If it's already reserved before this call, we should probably leave it > as is? Yeah, that makes sense. However, there is currently no way to know if a range has already been reserved or not. One option I can think of is to have hugetlb_reserve_pages() provide return info (nr_reserved?) about how many pages were successfully reserved. This way we can know if we were the ones responsible for the reservation. > > Or maybe we never want to un-reserve at all here? That also seems to be an option. AFAICS, un-reserving only appears to happen as part of truncate/hole punch/inode eviction and not if hugetlb_fault() fails. Unless I am overlooking something, not calling hugetlb_unreserve_pages() if allocation fails seems to be OK. Thanks, Vivek > > -- > Cheers, > > David / dhildenb