Re: [PATCH v2] hugetlb: freeze allocated pages before creating hugetlb pages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Sep 16, 2022 at 02:46:38PM -0700, Mike Kravetz wrote:
> When creating hugetlb pages, the hugetlb code must first allocate
> contiguous pages from a low level allocator such as buddy, cma or
> memblock.  The pages returned from these low level allocators are
> ref counted.  This creates potential issues with other code taking
> speculative references on these pages before they can be transformed to
> a hugetlb page.  This issue has been addressed with methods and code
> such as that provided in [1].
> 
> Recent discussions about vmemmap freeing [2] have indicated that it
> would be beneficial to freeze all sub pages, including the head page
> of pages returned from low level allocators before converting to a
> hugetlb page.  This helps avoid races if we want to replace the page
> containing vmemmap for the head page.
> 
> There have been proposals to change at least the buddy allocator to
> return frozen pages as described at [3].  If such a change is made, it
> can be employed by the hugetlb code.  However, as mentioned above
> hugetlb uses several low level allocators so each would need to be
> modified to return frozen pages.  For now, we can manually freeze the
> returned pages.  This is done in two places:
> 1) alloc_buddy_huge_page, only the returned head page is ref counted.
>    We freeze the head page, retrying once in the VERY rare case where
>    there may be an inflated ref count.
> 2) prep_compound_gigantic_page, for gigantic pages the current code
>    freezes all pages except the head page.  New code will simply freeze
>    the head page as well.
> 
> In a few other places, code checks for inflated ref counts on newly
> allocated hugetlb pages.  With the modifications to freeze after
> allocating, this code can be removed.
> 
> After hugetlb pages are freshly allocated, they are often added to the
> hugetlb free lists.  Since these pages were previously ref counted, this
> was done via put_page() which would end up calling the hugetlb
> destructor: free_huge_page.  With changes to freeze pages, we simply
> call free_huge_page directly to add the pages to the free list.
> 
> In a few other places, freshly allocated hugetlb pages were immediately
> put into use, and the expectation was they were already ref counted.  In
> these cases, we must manually ref count the page.
> 
> [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@xxxxxxxxxx/
> [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@xxxxxxxxxx/
> [3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@xxxxxxxxxxxxx/
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> ---
> v1 -> v2
> - Fixed up head page in error path of __prep_compound_gigantic_page as
>   discovered by Miaohe Lin.
> - Updated link to Matthew's Allocate and free frozen pages series.
> - Rebased on next-20220916
> 
>  mm/hugetlb.c | 102 +++++++++++++++++++--------------------------------
>  1 file changed, 38 insertions(+), 64 deletions(-)

Hello Mike,

I accidentally found a NULL pointer dereference when testing the latest
mm-unstable, which seems to be caused (or exposed?) by this patch
(I confirmed that it disappeared by reverting this patch).

It's reproduced by doing like `sysctl vm.nr_hugepages=1000000` to allocate
hugepages as much as possible.

Could you check that this patch is related to the issue?

Thanks,
Naoya Horiguchi

---
[   25.634476] BUG: kernel NULL pointer dereference, address: 0000000000000034
[   25.635980] #PF: supervisor write access in kernel mode
[   25.637283] #PF: error_code(0x0002) - not-present page
[   25.638365] PGD 0 P4D 0
[   25.638906] Oops: 0002 [#1] PREEMPT SMP PTI
[   25.639779] CPU: 4 PID: 819 Comm: sysctl Tainted: G            E    N 6.0.0-rc3-v6.0-rc1-220920-1758-1398-g2b3f5+ #12
[   25.641928] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
[   25.643727] RIP: 0010:alloc_buddy_huge_page.isra.0+0x8c/0x140
[   25.645071] Code: fe ff 41 83 fc 01 0f 84 54 94 8b 00 41 bc 01 00 00 00 44 89 f7 4c 89 f9 44 89 ea 89 de e8 7c b9 fe ff 48 89 c7 b8 01 00 00 00 <f0> 0f b1 6f 34 66 90 83 f8 01 75 c5 48 85 ff 74 52 65 48 ff 05 03
[   25.649006] RSP: 0018:ffffaa7181fffc18 EFLAGS: 00010286
[   25.650215] RAX: 0000000000000001 RBX: 0000000000000009 RCX: 0000000000000009
[   25.651672] RDX: ffffffffae3b6df0 RSI: ffffffffae8f7ce0 RDI: 0000000000000000
[   25.653115] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000c01
[   25.654579] R10: 0000000000000f90 R11: 0000000000000000 R12: 0000000000000002
[   25.656176] R13: 0000000000000000 R14: 0000000000346cca R15: ffffffffae8f7ce0
[   25.657637] FS:  00007f9252f2a740(0000) GS:ffff98cebbc00000(0000) knlGS:0000000000000000
[   25.659292] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   25.660469] CR2: 0000000000000034 CR3: 000000014924c004 CR4: 0000000000170ee0
[   25.661928] Call Trace:
[   25.662469]  <TASK>
[   25.662927]  alloc_fresh_huge_page+0x16f/0x1d0
[   25.663859]  alloc_pool_huge_page+0x6d/0xb0
[   25.664734]  __nr_hugepages_store_common+0x189/0x3e0
[   25.665764]  ? __do_proc_doulongvec_minmax+0x31f/0x340
[   25.666832]  hugetlb_sysctl_handler_common+0xbf/0xd0
[   25.667861]  ? hugetlb_register_node+0xe0/0xe0
[   25.668786]  proc_sys_call_handler+0x196/0x2b0
[   25.669724]  vfs_write+0x29b/0x3a0
[   25.670454]  ksys_write+0x4f/0xd0
[   25.671153]  do_syscall_64+0x3b/0x90
[   25.671909]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   25.672958] RIP: 0033:0x7f9252d3e727
[   25.673712] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   25.677470] RSP: 002b:00007ffcdf9904a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   25.679002] RAX: ffffffffffffffda RBX: 000055c6ae683210 RCX: 00007f9252d3e727
[   25.680456] RDX: 0000000000000006 RSI: 000055c6ae683250 RDI: 0000000000000003
[   25.681910] RBP: 000055c6ae685380 R08: 0000000000000003 R09: 0000000000000077
[   25.683373] R10: 000000000000006b R11: 0000000000000246 R12: 0000000000000006
[   25.684824] R13: 0000000000000006 R14: 0000000000000006 R15: 00007f9252df59e0
[   25.686293]  </TASK>




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux