Re: [PATCH v2] hugetlb: freeze allocated pages before creating hugetlb pages

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Tue, 20 Sep 2022 20:17:31 -0700

On 09/21/22 01:48, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Fri, Sep 16, 2022 at 02:46:38PM -0700, Mike Kravetz wrote:
> > v1 -> v2
> > - Fixed up head page in error path of __prep_compound_gigantic_page as
> >   discovered by Miaohe Lin.
> > - Updated link to Matthew's Allocate and free frozen pages series.
> > - Rebased on next-20220916
> > 
> >  mm/hugetlb.c | 102 +++++++++++++++++++--------------------------------
> >  1 file changed, 38 insertions(+), 64 deletions(-)
> 
> Hello Mike,
> 
> I accidentally found a NULL pointer dereference when testing the latest
> mm-unstable, which seems to be caused (or exposed?) by this patch
> (I confirmed that it disappeared by reverting this patch).
> 
> It's reproduced by doing like `sysctl vm.nr_hugepages=1000000` to allocate
> hugepages as much as possible.
> 
> Could you check that this patch is related to the issue?
> 
> Thanks,
> Naoya Horiguchi
> 
> ---
> [   25.634476] BUG: kernel NULL pointer dereference, address: 0000000000000034
> [   25.635980] #PF: supervisor write access in kernel mode
> [   25.637283] #PF: error_code(0x0002) - not-present page
> [   25.638365] PGD 0 P4D 0
> [   25.638906] Oops: 0002 [#1] PREEMPT SMP PTI
> [   25.639779] CPU: 4 PID: 819 Comm: sysctl Tainted: G            E    N 6.0.0-rc3-v6.0-rc1-220920-1758-1398-g2b3f5+ #12
> [   25.641928] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
> [   25.643727] RIP: 0010:alloc_buddy_huge_page.isra.0+0x8c/0x140
> [   25.645071] Code: fe ff 41 83 fc 01 0f 84 54 94 8b 00 41 bc 01 00 00 00 44 89 f7 4c 89 f9 44 89 ea 89 de e8 7c b9 fe ff 48 89 c7 b8 01 00 00 00 <f0> 0f b1 6f 34 66 90 83 f8 01 75 c5 48 85 ff 74 52 65 48 ff 05 03
> [   25.649006] RSP: 0018:ffffaa7181fffc18 EFLAGS: 00010286
> [   25.650215] RAX: 0000000000000001 RBX: 0000000000000009 RCX: 0000000000000009
> [   25.651672] RDX: ffffffffae3b6df0 RSI: ffffffffae8f7ce0 RDI: 0000000000000000
> [   25.653115] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000c01
> [   25.654579] R10: 0000000000000f90 R11: 0000000000000000 R12: 0000000000000002
> [   25.656176] R13: 0000000000000000 R14: 0000000000346cca R15: ffffffffae8f7ce0
> [   25.657637] FS:  00007f9252f2a740(0000) GS:ffff98cebbc00000(0000) knlGS:0000000000000000
> [   25.659292] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   25.660469] CR2: 0000000000000034 CR3: 000000014924c004 CR4: 0000000000170ee0
> [   25.661928] Call Trace:
> [   25.662469]  <TASK>
> [   25.662927]  alloc_fresh_huge_page+0x16f/0x1d0
> [   25.663859]  alloc_pool_huge_page+0x6d/0xb0
> [   25.664734]  __nr_hugepages_store_common+0x189/0x3e0
> [   25.665764]  ? __do_proc_doulongvec_minmax+0x31f/0x340
> [   25.666832]  hugetlb_sysctl_handler_common+0xbf/0xd0
> [   25.667861]  ? hugetlb_register_node+0xe0/0xe0
> [   25.668786]  proc_sys_call_handler+0x196/0x2b0
> [   25.669724]  vfs_write+0x29b/0x3a0
> [   25.670454]  ksys_write+0x4f/0xd0
> [   25.671153]  do_syscall_64+0x3b/0x90
> [   25.671909]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [   25.672958] RIP: 0033:0x7f9252d3e727
> [   25.673712] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> [   25.677470] RSP: 002b:00007ffcdf9904a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [   25.679002] RAX: ffffffffffffffda RBX: 000055c6ae683210 RCX: 00007f9252d3e727
> [   25.680456] RDX: 0000000000000006 RSI: 000055c6ae683250 RDI: 0000000000000003
> [   25.681910] RBP: 000055c6ae685380 R08: 0000000000000003 R09: 0000000000000077
> [   25.683373] R10: 000000000000006b R11: 0000000000000246 R12: 0000000000000006
> [   25.684824] R13: 0000000000000006 R14: 0000000000000006 R15: 00007f9252df59e0
> [   25.686293]  </TASK>

Hello Naoya,

My bad for an obvious mistake!  Note this change in the patch,

> @@ -1951,7 +1953,21 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
>  		gfp_mask |= __GFP_RETRY_MAYFAIL;
>  	if (nid == NUMA_NO_NODE)
>  		nid = numa_mem_id();
> +retry:
>  	page = __alloc_pages(gfp_mask, order, nid, nmask);
> +
> +	/* Freeze head page */
> +	if (!page_ref_freeze(page, 1)) {
> +		__free_pages(page, order);
> +		if (retry) {	/* retry once */
> +			retry = false;
> +			goto retry;
> +		}
> +		/* WOW!  twice in a row. */
> +		pr_warn("HugeTLB head page unexpected inflated ref count\n");
> +		page = NULL;
> +	}
> +
>  	if (page)
>  		__count_vm_event(HTLB_BUDDY_PGALLOC);
>  	else

It does not check for a successful return from __alloc_pages before trying to
freeze 'page'.  It should obviously check for page == NULL before trying to
freeze with something like this.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index db90658db171..a092dd639687 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1961,7 +1961,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	page = __alloc_pages(gfp_mask, order, nid, nmask);
 
 	/* Freeze head page */
-	if (!page_ref_freeze(page, 1)) {
+	if (page && !page_ref_freeze(page, 1)) {
 		__free_pages(page, order);
 		if (retry) {	/* retry once */
 			retry = false;

I will send out a v3 (my) tomorrow.
-- 
Mike Kravetz