Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly allocated page

Muchun Song <songmuchun@xxxxxxxxxxxxx> · Thu, 4 Aug 2022 19:53:31 +0800

On Thu, Aug 04, 2022 at 10:23:48AM +0100, Joao Martins wrote:
> On 8/4/22 08:17, Muchun Song wrote:
> > On Wed, Aug 03, 2022 at 01:22:21PM +0100, Joao Martins wrote:
> >>
> >>
> >> On 8/3/22 11:44, Muchun Song wrote:
> >>> On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote:
> >>>> On 8/3/22 05:11, Muchun Song wrote:
> >>>>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote:
> >>>>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is
> >>>>>> freed back to page allocator is as following: for a 2M hugetlb page it
> >>>>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap
> >>>>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap
> >>>>>> pages. Essentially, that means that it breaks the first 4K of a
> >>>>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or
> >>>>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free
> >>>>>> back to page allocator cannot be used for hugetlb to allocate huge pages
> >>>>>> of the same size, but rather only of a smaller huge page size:
> >>>>>>
> >>>>>
> >>>>> Hi Joao,
> >>>>>
> >>>>> Thanks for your work on this. I admit you are right. The current mechanism
> >>>>> prevented the freed vmemmap pages from being mergerd into a potential
> >>>>> contiguous page. Allocating a new head page is straightforward approach,
> >>>>> however, it is very dangerous at runtime after system booting up. Why
> >>>>> dangerous? Because you should first 1) copy the content from the head vmemmap
> >>>>> page to the targeted (newly allocated) page, and then 2) change the PTE
> >>>>> entry to the new page. However, the content (especially the refcount) of the
> >>>>> old head vmemmmap page could be changed elsewhere (e.g. other modules)
> >>>>> between the step 1) and 2). Eventually, the new allocated vmemmap page is
> >>>>> corrupted. Unluckily, we don't have an easy approach to prevent it.
> >>>>>
> >>>> OK, I see what I missed. You mean the refcount (or any other data) on the
> >>>> preceeding struct pages to this head struct page unrelated to the hugetlb
> >>>> page being remapped. Meaning when the first struct page in the old vmemmap
> >>>> page is *not* the head page we are trying to remap right?
> >>>>
> >>>> See further below in your patch but I wonder if we could actually check this
> >>>> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking
> >>>> that this head page is the first of struct page in the vmemmap page that
> >>>> we are still safe to remap? If so, the patch could be simpler more
> >>>> like mine, without the special freeing path you added below.
> >>>>
> >>>> If I'm right, see at the end.
> >>>
> >>> I am not sure we are on the same page (it seems that we are not after I saw your
> >>> below patch). 
> >>
> >> Even though I misunderstood you it might still look like a possible scenario.
> >>
> >>> So let me make it become more clarified.
> >>>
> >> Thanks
> >>
> >>> CPU0:                                            CPU1:
> >>>
> >>> vmemmap_remap_free(start, end, reuse)
> >>>   // alloc a new page used to be the head vmemmap page
> >>>   page = alloc_pages_node();
> >>>
> >>>   memcpy(page_address(page), reuse, PAGE_SIZE);
> >>>                                                  // Now the @reuse address is mapped to the original
> >>>                                                  // page frame. So the change will be reflected on the
> >>>                                                  // original page frame.
> >>>                                                  get_page(reuse);
> >>>   vmemmap_remap_pte();
> >>>    // remap to the above new allocated page 
> >>>    set_pte_at();
> >>>
> >>>   flush_tlb_kernel_range();
> >>
> >> note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range.
> >>
> > 
> > Right. I have noticed that as well.
> > 
> >>>                                                  // Now the @reuse address is mapped to the new allocated
> >>>                                                  // page frame. So the change will be reflected on the
> >>>                                                  // new page frame and it is corrupted.
> >>>                                                  put_page(reuse);
> >>>
> >>> So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however
> >>> it is not easy.
> >>>
> >> OK, I understand what you mean now. However, I am trying to follow if this race is
> >> possible? Note that given your previous answer above, I am assuming in your race scenario
> >> that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page
> >> currently being remapped. If this assumption is wrong, then your race would be possible
> >> (but it wouldn't be from a get_page in the reuse_addr)
> >>
> >> So, how would we get into doing a get_page() on the head-page that we are remapping (and
> >> its put_page() for that matter) from somewhere else ... considering we are at
> >> prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it
> >> from page allocator ... but hugetlb hasn't yet finished initializing the head page
> >> metadata. Hence it isn't yet accounted for someone to grab either e.g. in
> >> demote/page-fault/migration/etc?
> >>
> > 
> > As I know, at least there are two places which could get the refcount.
> > 1) GUP and 2) Memoey failure.
> > 
> So I am aware the means to grab the page refcount. It's how we get into such situation
> that I wasn't sure how we get to in the first place:
> 
> > For 1), you can refer to the commit 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849.
> 
> Good pointer. This one sheds some light too:
> 
> https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@xxxxxxxxxxxxxx/
> 
> I wonder we get into a situation of doing a GUP on a user VA referring a
> just-allocated *hugetlb* page from buddy (but not yet in hugetlb free lists) even if
> temporarily. Unless the page was still siting on a page tables that were about to tear
> down. Maybe it is specific to gigantic pages. Hmm
>

It is not specific to gigantic pages. I think your above link has a good
description to show how this could happen to a non-gigantic page.

Thanks.

> > For 2), I think you can refer to the function of __get_unpoison_page().
> > 
> Ah, yes. Good point.
> 
> > Both places could grab an extra refcount to the processing HugeTLB's struct page.
> > 
> Thanks for the pointers ;)
>