On 08/03/22 18:44, Muchun Song wrote: > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > > On 8/3/22 05:11, Muchun Song wrote: > > > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > > >> Today with `hugetlb_free_vmemmap=on` the struct page memory that is > > >> freed back to page allocator is as following: for a 2M hugetlb page it > > >> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap > > >> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap > > >> pages. Essentially, that means that it breaks the first 4K of a > > >> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or > > >> 16M (for 1G hugetlb pages). For this reason the memory that it's free > > >> back to page allocator cannot be used for hugetlb to allocate huge pages > > >> of the same size, but rather only of a smaller huge page size: > > >> > > > > > > Hi Joao, > > > > > > Thanks for your work on this. I admit you are right. The current mechanism > > > prevented the freed vmemmap pages from being mergerd into a potential > > > contiguous page. Allocating a new head page is straightforward approach, > > > however, it is very dangerous at runtime after system booting up. Why > > > dangerous? Because you should first 1) copy the content from the head vmemmap > > > page to the targeted (newly allocated) page, and then 2) change the PTE > > > entry to the new page. However, the content (especially the refcount) of the > > > old head vmemmmap page could be changed elsewhere (e.g. other modules) > > > between the step 1) and 2). Eventually, the new allocated vmemmap page is > > > corrupted. Unluckily, we don't have an easy approach to prevent it. > > > > > OK, I see what I missed. You mean the refcount (or any other data) on the > > preceeding struct pages to this head struct page unrelated to the hugetlb > > page being remapped. Meaning when the first struct page in the old vmemmap > > page is *not* the head page we are trying to remap right? > > > > See further below in your patch but I wonder if we could actually check this > > from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking > > that this head page is the first of struct page in the vmemmap page that > > we are still safe to remap? If so, the patch could be simpler more > > like mine, without the special freeing path you added below. > > > > If I'm right, see at the end. > > > I am not sure we are on the same page (it seems that we are not after I saw your > below patch). So let me make it become more clarified. Thanks Muchun! I told Joao that you would be the expert in this area and was correct. > > CPU0: CPU1: > > vmemmap_remap_free(start, end, reuse) > // alloc a new page used to be the head vmemmap page > page = alloc_pages_node(); > > memcpy(page_address(page), reuse, PAGE_SIZE); > // Now the @reuse address is mapped to the original > // page frame. So the change will be reflected on the > // original page frame. > get_page(reuse); Here is a thought. This code gets called early after allocating a new hugetlb page. This new compound page has a ref count of 1 on the head page and 0 on all the tail pages. If the ref count was 0 on the head page, get_page() would not succeed. I can not think of a reason why we NEED to have a ref count of 1 on the head page. It does make it more convenient to simply call put_page() on the newly initialized hugetlb page and have the page be added to the huegtlb free lists, but this could easily be modified. Matthew Willcox recently pointed me at this: https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@xxxxxxxxxxxxx/T/#m98fb9f9bd476155b4951339da51a0887b2377476 That would only work for hugetlb pages allocated from buddy. For gigantic pages, we manually 'freeze' (zero ref count) of tail pages and check for an unexpected increased ref count. We could do the same with the head page. Would having zero ref count on the head page eliminate this race? -- Mike Kravetz > vmemmap_remap_pte(); > // remap to the above new allocated page > set_pte_at(); > > flush_tlb_kernel_range(); > // Now the @reuse address is mapped to the new allocated > // page frame. So the change will be reflected on the > // new page frame and it is corrupted. > put_page(reuse); > > So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however > it is not easy.