On Wed, Aug 03, 2022 at 03:42:15PM -0700, Mike Kravetz wrote: > On 08/03/22 18:44, Muchun Song wrote: > > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > > > On 8/3/22 05:11, Muchun Song wrote: > > > > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > > > >> Today with `hugetlb_free_vmemmap=on` the struct page memory that is > > > >> freed back to page allocator is as following: for a 2M hugetlb page it > > > >> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap > > > >> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap > > > >> pages. Essentially, that means that it breaks the first 4K of a > > > >> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or > > > >> 16M (for 1G hugetlb pages). For this reason the memory that it's free > > > >> back to page allocator cannot be used for hugetlb to allocate huge pages > > > >> of the same size, but rather only of a smaller huge page size: > > > >> > > > > > > > > Hi Joao, > > > > > > > > Thanks for your work on this. I admit you are right. The current mechanism > > > > prevented the freed vmemmap pages from being mergerd into a potential > > > > contiguous page. Allocating a new head page is straightforward approach, > > > > however, it is very dangerous at runtime after system booting up. Why > > > > dangerous? Because you should first 1) copy the content from the head vmemmap > > > > page to the targeted (newly allocated) page, and then 2) change the PTE > > > > entry to the new page. However, the content (especially the refcount) of the > > > > old head vmemmmap page could be changed elsewhere (e.g. other modules) > > > > between the step 1) and 2). Eventually, the new allocated vmemmap page is > > > > corrupted. Unluckily, we don't have an easy approach to prevent it. > > > > > > > OK, I see what I missed. You mean the refcount (or any other data) on the > > > preceeding struct pages to this head struct page unrelated to the hugetlb > > > page being remapped. Meaning when the first struct page in the old vmemmap > > > page is *not* the head page we are trying to remap right? > > > > > > See further below in your patch but I wonder if we could actually check this > > > from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking > > > that this head page is the first of struct page in the vmemmap page that > > > we are still safe to remap? If so, the patch could be simpler more > > > like mine, without the special freeing path you added below. > > > > > > If I'm right, see at the end. > > > > > > I am not sure we are on the same page (it seems that we are not after I saw your > > below patch). So let me make it become more clarified. > > Thanks Muchun! > I told Joao that you would be the expert in this area and was correct. > Thanks for your trust. > > > > CPU0: CPU1: > > > > vmemmap_remap_free(start, end, reuse) > > // alloc a new page used to be the head vmemmap page > > page = alloc_pages_node(); > > > > memcpy(page_address(page), reuse, PAGE_SIZE); > > // Now the @reuse address is mapped to the original > > // page frame. So the change will be reflected on the > > // original page frame. > > get_page(reuse); > > Here is a thought. > > This code gets called early after allocating a new hugetlb page. This new > compound page has a ref count of 1 on the head page and 0 on all the tail > pages. If the ref count was 0 on the head page, get_page() would not succeed. > > I can not think of a reason why we NEED to have a ref count of 1 on the > head page. It does make it more convenient to simply call put_page() on > the newly initialized hugetlb page and have the page be added to the huegtlb > free lists, but this could easily be modified. > > Matthew Willcox recently pointed me at this: > https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@xxxxxxxxxxxxx/T/#m98fb9f9bd476155b4951339da51a0887b2377476 > > That would only work for hugetlb pages allocated from buddy. For gigantic > pages, we manually 'freeze' (zero ref count) of tail pages and check for > an unexpected increased ref count. We could do the same with the head page. > > Would having zero ref count on the head page eliminate this race? I think most races which try to grab an extra refcount could be avoided in this case. However, I can figure out a race related to memory failure, that is to poison a page in memory_failure(), which set PageHWPoison without checking if the refcount is zero. If we can fix this easily, I think this patch is a good direction. Thanks Mike. > -- > Mike Kravetz > > > vmemmap_remap_pte(); > > // remap to the above new allocated page > > set_pte_at(); > > > > flush_tlb_kernel_range(); > > // Now the @reuse address is mapped to the new allocated > > // page frame. So the change will be reflected on the > > // new page frame and it is corrupted. > > put_page(reuse); > > > > So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however > > it is not easy. >