Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly allocated page

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/4/22 08:17, Muchun Song wrote:
> On Wed, Aug 03, 2022 at 01:22:21PM +0100, Joao Martins wrote:
>>
>>
>> On 8/3/22 11:44, Muchun Song wrote:
>>> On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote:
>>>> On 8/3/22 05:11, Muchun Song wrote:
>>>>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote:
>>>>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is
>>>>>> freed back to page allocator is as following: for a 2M hugetlb page it
>>>>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap
>>>>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap
>>>>>> pages. Essentially, that means that it breaks the first 4K of a
>>>>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or
>>>>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free
>>>>>> back to page allocator cannot be used for hugetlb to allocate huge pages
>>>>>> of the same size, but rather only of a smaller huge page size:
>>>>>>
>>>>>
>>>>> Hi Joao,
>>>>>
>>>>> Thanks for your work on this. I admit you are right. The current mechanism
>>>>> prevented the freed vmemmap pages from being mergerd into a potential
>>>>> contiguous page. Allocating a new head page is straightforward approach,
>>>>> however, it is very dangerous at runtime after system booting up. Why
>>>>> dangerous? Because you should first 1) copy the content from the head vmemmap
>>>>> page to the targeted (newly allocated) page, and then 2) change the PTE
>>>>> entry to the new page. However, the content (especially the refcount) of the
>>>>> old head vmemmmap page could be changed elsewhere (e.g. other modules)
>>>>> between the step 1) and 2). Eventually, the new allocated vmemmap page is
>>>>> corrupted. Unluckily, we don't have an easy approach to prevent it.
>>>>>
>>>> OK, I see what I missed. You mean the refcount (or any other data) on the
>>>> preceeding struct pages to this head struct page unrelated to the hugetlb
>>>> page being remapped. Meaning when the first struct page in the old vmemmap
>>>> page is *not* the head page we are trying to remap right?
>>>>
>>>> See further below in your patch but I wonder if we could actually check this
>>>> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking
>>>> that this head page is the first of struct page in the vmemmap page that
>>>> we are still safe to remap? If so, the patch could be simpler more
>>>> like mine, without the special freeing path you added below.
>>>>
>>>> If I'm right, see at the end.
>>>
>>> I am not sure we are on the same page (it seems that we are not after I saw your
>>> below patch). 
>>
>> Even though I misunderstood you it might still look like a possible scenario.
>>
>>> So let me make it become more clarified.
>>>
>> Thanks
>>
>>> CPU0:                                            CPU1:
>>>
>>> vmemmap_remap_free(start, end, reuse)
>>>   // alloc a new page used to be the head vmemmap page
>>>   page = alloc_pages_node();
>>>
>>>   memcpy(page_address(page), reuse, PAGE_SIZE);
>>>                                                  // Now the @reuse address is mapped to the original
>>>                                                  // page frame. So the change will be reflected on the
>>>                                                  // original page frame.
>>>                                                  get_page(reuse);
>>>   vmemmap_remap_pte();
>>>    // remap to the above new allocated page 
>>>    set_pte_at();
>>>
>>>   flush_tlb_kernel_range();
>>
>> note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range.
>>
> 
> Right. I have noticed that as well.
> 
>>>                                                  // Now the @reuse address is mapped to the new allocated
>>>                                                  // page frame. So the change will be reflected on the
>>>                                                  // new page frame and it is corrupted.
>>>                                                  put_page(reuse);
>>>
>>> So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however
>>> it is not easy.
>>>
>> OK, I understand what you mean now. However, I am trying to follow if this race is
>> possible? Note that given your previous answer above, I am assuming in your race scenario
>> that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page
>> currently being remapped. If this assumption is wrong, then your race would be possible
>> (but it wouldn't be from a get_page in the reuse_addr)
>>
>> So, how would we get into doing a get_page() on the head-page that we are remapping (and
>> its put_page() for that matter) from somewhere else ... considering we are at
>> prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it
>> from page allocator ... but hugetlb hasn't yet finished initializing the head page
>> metadata. Hence it isn't yet accounted for someone to grab either e.g. in
>> demote/page-fault/migration/etc?
>>
> 
> As I know, at least there are two places which could get the refcount.
> 1) GUP and 2) Memoey failure.
> 
So I am aware the means to grab the page refcount. It's how we get into such situation
that I wasn't sure how we get to in the first place:

> For 1), you can refer to the commit 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849.

Good pointer. This one sheds some light too:

https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@xxxxxxxxxxxxxx/

I wonder we get into a situation of doing a GUP on a user VA referring a
just-allocated *hugetlb* page from buddy (but not yet in hugetlb free lists) even if
temporarily. Unless the page was still siting on a page tables that were about to tear
down. Maybe it is specific to gigantic pages. Hmm

> For 2), I think you can refer to the function of __get_unpoison_page().
> 
Ah, yes. Good point.

> Both places could grab an extra refcount to the processing HugeTLB's struct page.
> 
Thanks for the pointers ;)




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux