On 8/4/22 08:17, Muchun Song wrote: > On Wed, Aug 03, 2022 at 01:22:21PM +0100, Joao Martins wrote: >> >> >> On 8/3/22 11:44, Muchun Song wrote: >>> On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: >>>> On 8/3/22 05:11, Muchun Song wrote: >>>>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: >>>>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is >>>>>> freed back to page allocator is as following: for a 2M hugetlb page it >>>>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap >>>>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap >>>>>> pages. Essentially, that means that it breaks the first 4K of a >>>>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or >>>>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free >>>>>> back to page allocator cannot be used for hugetlb to allocate huge pages >>>>>> of the same size, but rather only of a smaller huge page size: >>>>>> >>>>> >>>>> Hi Joao, >>>>> >>>>> Thanks for your work on this. I admit you are right. The current mechanism >>>>> prevented the freed vmemmap pages from being mergerd into a potential >>>>> contiguous page. Allocating a new head page is straightforward approach, >>>>> however, it is very dangerous at runtime after system booting up. Why >>>>> dangerous? Because you should first 1) copy the content from the head vmemmap >>>>> page to the targeted (newly allocated) page, and then 2) change the PTE >>>>> entry to the new page. However, the content (especially the refcount) of the >>>>> old head vmemmmap page could be changed elsewhere (e.g. other modules) >>>>> between the step 1) and 2). Eventually, the new allocated vmemmap page is >>>>> corrupted. Unluckily, we don't have an easy approach to prevent it. >>>>> >>>> OK, I see what I missed. You mean the refcount (or any other data) on the >>>> preceeding struct pages to this head struct page unrelated to the hugetlb >>>> page being remapped. Meaning when the first struct page in the old vmemmap >>>> page is *not* the head page we are trying to remap right? >>>> >>>> See further below in your patch but I wonder if we could actually check this >>>> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking >>>> that this head page is the first of struct page in the vmemmap page that >>>> we are still safe to remap? If so, the patch could be simpler more >>>> like mine, without the special freeing path you added below. >>>> >>>> If I'm right, see at the end. >>> >>> I am not sure we are on the same page (it seems that we are not after I saw your >>> below patch). >> >> Even though I misunderstood you it might still look like a possible scenario. >> >>> So let me make it become more clarified. >>> >> Thanks >> >>> CPU0: CPU1: >>> >>> vmemmap_remap_free(start, end, reuse) >>> // alloc a new page used to be the head vmemmap page >>> page = alloc_pages_node(); >>> >>> memcpy(page_address(page), reuse, PAGE_SIZE); >>> // Now the @reuse address is mapped to the original >>> // page frame. So the change will be reflected on the >>> // original page frame. >>> get_page(reuse); >>> vmemmap_remap_pte(); >>> // remap to the above new allocated page >>> set_pte_at(); >>> >>> flush_tlb_kernel_range(); >> >> note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range. >> > > Right. I have noticed that as well. > >>> // Now the @reuse address is mapped to the new allocated >>> // page frame. So the change will be reflected on the >>> // new page frame and it is corrupted. >>> put_page(reuse); >>> >>> So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however >>> it is not easy. >>> >> OK, I understand what you mean now. However, I am trying to follow if this race is >> possible? Note that given your previous answer above, I am assuming in your race scenario >> that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page >> currently being remapped. If this assumption is wrong, then your race would be possible >> (but it wouldn't be from a get_page in the reuse_addr) >> >> So, how would we get into doing a get_page() on the head-page that we are remapping (and >> its put_page() for that matter) from somewhere else ... considering we are at >> prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it >> from page allocator ... but hugetlb hasn't yet finished initializing the head page >> metadata. Hence it isn't yet accounted for someone to grab either e.g. in >> demote/page-fault/migration/etc? >> > > As I know, at least there are two places which could get the refcount. > 1) GUP and 2) Memoey failure. > So I am aware the means to grab the page refcount. It's how we get into such situation that I wasn't sure how we get to in the first place: > For 1), you can refer to the commit 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849. Good pointer. This one sheds some light too: https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@xxxxxxxxxxxxxx/ I wonder we get into a situation of doing a GUP on a user VA referring a just-allocated *hugetlb* page from buddy (but not yet in hugetlb free lists) even if temporarily. Unless the page was still siting on a page tables that were about to tear down. Maybe it is specific to gigantic pages. Hmm > For 2), I think you can refer to the function of __get_unpoison_page(). > Ah, yes. Good point. > Both places could grab an extra refcount to the processing HugeTLB's struct page. > Thanks for the pointers ;)