On Thu, Aug 04, 2022 at 10:23:48AM +0100, Joao Martins wrote: > On 8/4/22 08:17, Muchun Song wrote: > > On Wed, Aug 03, 2022 at 01:22:21PM +0100, Joao Martins wrote: > >> > >> > >> On 8/3/22 11:44, Muchun Song wrote: > >>> On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > >>>> On 8/3/22 05:11, Muchun Song wrote: > >>>>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > >>>>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is > >>>>>> freed back to page allocator is as following: for a 2M hugetlb page it > >>>>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap > >>>>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap > >>>>>> pages. Essentially, that means that it breaks the first 4K of a > >>>>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or > >>>>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free > >>>>>> back to page allocator cannot be used for hugetlb to allocate huge pages > >>>>>> of the same size, but rather only of a smaller huge page size: > >>>>>> > >>>>> > >>>>> Hi Joao, > >>>>> > >>>>> Thanks for your work on this. I admit you are right. The current mechanism > >>>>> prevented the freed vmemmap pages from being mergerd into a potential > >>>>> contiguous page. Allocating a new head page is straightforward approach, > >>>>> however, it is very dangerous at runtime after system booting up. Why > >>>>> dangerous? Because you should first 1) copy the content from the head vmemmap > >>>>> page to the targeted (newly allocated) page, and then 2) change the PTE > >>>>> entry to the new page. However, the content (especially the refcount) of the > >>>>> old head vmemmmap page could be changed elsewhere (e.g. other modules) > >>>>> between the step 1) and 2). Eventually, the new allocated vmemmap page is > >>>>> corrupted. Unluckily, we don't have an easy approach to prevent it. > >>>>> > >>>> OK, I see what I missed. You mean the refcount (or any other data) on the > >>>> preceeding struct pages to this head struct page unrelated to the hugetlb > >>>> page being remapped. Meaning when the first struct page in the old vmemmap > >>>> page is *not* the head page we are trying to remap right? > >>>> > >>>> See further below in your patch but I wonder if we could actually check this > >>>> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking > >>>> that this head page is the first of struct page in the vmemmap page that > >>>> we are still safe to remap? If so, the patch could be simpler more > >>>> like mine, without the special freeing path you added below. > >>>> > >>>> If I'm right, see at the end. > >>> > >>> I am not sure we are on the same page (it seems that we are not after I saw your > >>> below patch). > >> > >> Even though I misunderstood you it might still look like a possible scenario. > >> > >>> So let me make it become more clarified. > >>> > >> Thanks > >> > >>> CPU0: CPU1: > >>> > >>> vmemmap_remap_free(start, end, reuse) > >>> // alloc a new page used to be the head vmemmap page > >>> page = alloc_pages_node(); > >>> > >>> memcpy(page_address(page), reuse, PAGE_SIZE); > >>> // Now the @reuse address is mapped to the original > >>> // page frame. So the change will be reflected on the > >>> // original page frame. > >>> get_page(reuse); > >>> vmemmap_remap_pte(); > >>> // remap to the above new allocated page > >>> set_pte_at(); > >>> > >>> flush_tlb_kernel_range(); > >> > >> note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range. > >> > > > > Right. I have noticed that as well. > > > >>> // Now the @reuse address is mapped to the new allocated > >>> // page frame. So the change will be reflected on the > >>> // new page frame and it is corrupted. > >>> put_page(reuse); > >>> > >>> So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however > >>> it is not easy. > >>> > >> OK, I understand what you mean now. However, I am trying to follow if this race is > >> possible? Note that given your previous answer above, I am assuming in your race scenario > >> that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page > >> currently being remapped. If this assumption is wrong, then your race would be possible > >> (but it wouldn't be from a get_page in the reuse_addr) > >> > >> So, how would we get into doing a get_page() on the head-page that we are remapping (and > >> its put_page() for that matter) from somewhere else ... considering we are at > >> prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it > >> from page allocator ... but hugetlb hasn't yet finished initializing the head page > >> metadata. Hence it isn't yet accounted for someone to grab either e.g. in > >> demote/page-fault/migration/etc? > >> > > > > As I know, at least there are two places which could get the refcount. > > 1) GUP and 2) Memoey failure. > > > So I am aware the means to grab the page refcount. It's how we get into such situation > that I wasn't sure how we get to in the first place: > > > For 1), you can refer to the commit 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849. > > Good pointer. This one sheds some light too: > > https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@xxxxxxxxxxxxxx/ > > I wonder we get into a situation of doing a GUP on a user VA referring a > just-allocated *hugetlb* page from buddy (but not yet in hugetlb free lists) even if > temporarily. Unless the page was still siting on a page tables that were about to tear > down. Maybe it is specific to gigantic pages. Hmm > It is not specific to gigantic pages. I think your above link has a good description to show how this could happen to a non-gigantic page. Thanks. > > For 2), I think you can refer to the function of __get_unpoison_page(). > > > Ah, yes. Good point. > > > Both places could grab an extra refcount to the processing HugeTLB's struct page. > > > Thanks for the pointers ;) >