On Wed, Aug 03, 2022 at 01:22:21PM +0100, Joao Martins wrote: > > > On 8/3/22 11:44, Muchun Song wrote: > > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > >> On 8/3/22 05:11, Muchun Song wrote: > >>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > >>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is > >>>> freed back to page allocator is as following: for a 2M hugetlb page it > >>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap > >>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap > >>>> pages. Essentially, that means that it breaks the first 4K of a > >>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or > >>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free > >>>> back to page allocator cannot be used for hugetlb to allocate huge pages > >>>> of the same size, but rather only of a smaller huge page size: > >>>> > >>> > >>> Hi Joao, > >>> > >>> Thanks for your work on this. I admit you are right. The current mechanism > >>> prevented the freed vmemmap pages from being mergerd into a potential > >>> contiguous page. Allocating a new head page is straightforward approach, > >>> however, it is very dangerous at runtime after system booting up. Why > >>> dangerous? Because you should first 1) copy the content from the head vmemmap > >>> page to the targeted (newly allocated) page, and then 2) change the PTE > >>> entry to the new page. However, the content (especially the refcount) of the > >>> old head vmemmmap page could be changed elsewhere (e.g. other modules) > >>> between the step 1) and 2). Eventually, the new allocated vmemmap page is > >>> corrupted. Unluckily, we don't have an easy approach to prevent it. > >>> > >> OK, I see what I missed. You mean the refcount (or any other data) on the > >> preceeding struct pages to this head struct page unrelated to the hugetlb > >> page being remapped. Meaning when the first struct page in the old vmemmap > >> page is *not* the head page we are trying to remap right? > >> > >> See further below in your patch but I wonder if we could actually check this > >> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking > >> that this head page is the first of struct page in the vmemmap page that > >> we are still safe to remap? If so, the patch could be simpler more > >> like mine, without the special freeing path you added below. > >> > >> If I'm right, see at the end. > > > > I am not sure we are on the same page (it seems that we are not after I saw your > > below patch). > > Even though I misunderstood you it might still look like a possible scenario. > > > So let me make it become more clarified. > > > Thanks > > > CPU0: CPU1: > > > > vmemmap_remap_free(start, end, reuse) > > // alloc a new page used to be the head vmemmap page > > page = alloc_pages_node(); > > > > memcpy(page_address(page), reuse, PAGE_SIZE); > > // Now the @reuse address is mapped to the original > > // page frame. So the change will be reflected on the > > // original page frame. > > get_page(reuse); > > vmemmap_remap_pte(); > > // remap to the above new allocated page > > set_pte_at(); > > > > flush_tlb_kernel_range(); > > note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range. > Right. I have noticed that as well. > > // Now the @reuse address is mapped to the new allocated > > // page frame. So the change will be reflected on the > > // new page frame and it is corrupted. > > put_page(reuse); > > > > So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however > > it is not easy. > > > OK, I understand what you mean now. However, I am trying to follow if this race is > possible? Note that given your previous answer above, I am assuming in your race scenario > that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page > currently being remapped. If this assumption is wrong, then your race would be possible > (but it wouldn't be from a get_page in the reuse_addr) > > So, how would we get into doing a get_page() on the head-page that we are remapping (and > its put_page() for that matter) from somewhere else ... considering we are at > prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it > from page allocator ... but hugetlb hasn't yet finished initializing the head page > metadata. Hence it isn't yet accounted for someone to grab either e.g. in > demote/page-fault/migration/etc? > As I know, at least there are two places which could get the refcount. 1) GUP and 2) Memoey failure. For 1), you can refer to the commit 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849. For 2), I think you can refer to the function of __get_unpoison_page(). Both places could grab an extra refcount to the processing HugeTLB's struct page. Thanks. > On the hugetlb freeing path (vmemmap_remap_alloc) there we just keep the already mapped > head page as is and we will not be allocating a new one. > > Perhaps I am missing something obvious. > > > Thanks. > > > >> > >>> I also thought of solving this issue, but I didn't find any easy way to > >>> solve it after system booting up. However, it is possible at system boot > >>> time. Because if the system is in a very early initialization stage, > >>> anyone should not access struct page. I have implemented it a mounth ago. > >>> I didn't send it out since some additional preparation work is required. > >>> The main preparation is to move the allocation of HugeTLB to a very > >>> early initialization stage (I want to put it into pure_initcall). Because > >>> the allocation of HugeTLB is parsed from cmdline whose logic is very > >>> complex, I have sent a patch to cleanup it [1]. After this cleanup, it > >>> will be easy to move the allocation to an early stage. Sorry for that > >>> I am busy lately, I didn't have time to send a new version. But I will > >>> send it ASAP. > >>> > >>> [1] https://lore.kernel.org/all/20220616071827.3480-1-songmuchun@xxxxxxxxxxxxx/ > >>> > >>> The following diff is the main work to remap the head vmemmap page to > >>> a new allocated page. > >>> > >> I like how you use walk::reuse_addr to tell it's an head page. > >> I didn't find my version as clean with passing a boolean to the remap_pte(). > >> > >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > >>> index 20f414c0379f..71f2d7335e6f 100644 > >>> --- a/mm/hugetlb_vmemmap.c > >>> +++ b/mm/hugetlb_vmemmap.c > >>> @@ -15,6 +15,7 @@ > >>> #include <asm/pgalloc.h> > >>> #include <asm/tlbflush.h> > >>> #include "hugetlb_vmemmap.h" > >>> +#include "internal.h" > >>> > >>> /** > >>> * struct vmemmap_remap_walk - walk vmemmap page table > >>> @@ -227,14 +228,37 @@ static inline void free_vmemmap_page(struct page *page) > >>> } > >>> > >>> /* Free a list of the vmemmap pages */ > >>> -static void free_vmemmap_page_list(struct list_head *list) > >>> +static void free_vmemmap_pages(struct list_head *list) > >>> { > >>> struct page *page, *next; > >>> > >>> + list_for_each_entry_safe(page, next, list, lru) > >>> + free_vmemmap_page(page); > >>> +} > >>> + > >>> +/* > >>> + * Free a list of vmemmap pages but skip per-cpu free list of buddy > >>> + * allocator. > >>> + */ > >>> +static void free_vmemmap_pages_nonpcp(struct list_head *list) > >>> +{ > >>> + struct zone *zone; > >>> + struct page *page, *next; > >>> + > >>> + if (list_empty(list)) > >>> + return; > >>> + > >>> + zone = page_zone(list_first_entry(list, struct page, lru)); > >>> + zone_pcp_disable(zone); > >>> list_for_each_entry_safe(page, next, list, lru) { > >>> - list_del(&page->lru); > >>> + if (zone != page_zone(page)) { > >>> + zone_pcp_enable(zone); > >>> + zone = page_zone(page); > >>> + zone_pcp_disable(zone); > >>> + } > >>> free_vmemmap_page(page); > >>> } > >>> + zone_pcp_enable(zone); > >>> } > >>> > >>> static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > >>> @@ -244,12 +268,28 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > >>> * Remap the tail pages as read-only to catch illegal write operation > >>> * to the tail pages. > >>> */ > >>> - pgprot_t pgprot = PAGE_KERNEL_RO; > >>> + pgprot_t pgprot = addr == walk->reuse_addr ? PAGE_KERNEL : PAGE_KERNEL_RO; > >>> pte_t entry = mk_pte(walk->reuse_page, pgprot); > >>> struct page *page = pte_page(*pte); > >>> > >>> list_add_tail(&page->lru, walk->vmemmap_pages); > >>> set_pte_at(&init_mm, addr, pte, entry); > >>> + > >>> + if (unlikely(addr == walk->reuse_addr)) { > >>> + void *old = page_to_virt(page); > >>> + > >>> + /* Remove it from the vmemmap_pages list to avoid being freed. */ > >>> + list_del(&walk->reuse_page->lru); > >>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > >>> + /* > >>> + * If we reach here meaning the system is in a very early > >>> + * initialization stage, anyone should not access struct page. > >>> + * However, if there is something unexpected, the head struct > >>> + * page most likely to be written (usually ->_refcount). Using > >>> + * BUG_ON() to catch this unexpected case. > >>> + */ > >>> + BUG_ON(memcmp(old, (void *)addr, sizeof(struct page))); > >>> + } > >>> } > >>> > >>> /* > >>> @@ -298,7 +338,10 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > >>> * to remap. > >>> * @end: end address of the vmemmap virtual address range that we want to > >>> * remap. > >>> - * @reuse: reuse address. > >>> + * @reuse: reuse address. If @reuse is equal to @start, it means the page > >>> + * frame which @reuse address is mapped to will be replaced with a > >>> + * new page frame and the previous page frame will be freed, this > >>> + * is to reduce memory fragment. > >>> * > >>> * Return: %0 on success, negative error code otherwise. > >>> */ > >>> @@ -319,14 +362,26 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > >>> * (see more details from the vmemmap_pte_range()): > >>> * > >>> * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) > >>> - * should be continuous. > >>> + * should be continuous or @start is equal to @reuse. > >>> * - The @reuse address is part of the range [@reuse, @end) that we are > >>> * walking which is passed to vmemmap_remap_range(). > >>> * - The @reuse address is the first in the complete range. > >>> * > >>> * So we need to make sure that @start and @reuse meet the above rules. > >>> */ > >>> - BUG_ON(start - reuse != PAGE_SIZE); > >>> + BUG_ON(start - reuse != PAGE_SIZE && start != reuse); > >>> + > >>> + if (unlikely(reuse == start)) { > >>> + int nid = page_to_nid((struct page *)start); > >>> + gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | > >>> + __GFP_NOWARN; > >>> + > >>> + walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0); > >>> + if (walk.reuse_page) { > >>> + copy_page(page_to_virt(walk.reuse_page), (void *)reuse); > >>> + list_add(&walk.reuse_page->lru, &vmemmap_pages); > >>> + } > >>> + } > >>> > >>> mmap_read_lock(&init_mm); > >>> ret = vmemmap_remap_range(reuse, end, &walk); > >>> @@ -348,7 +403,10 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > >>> } > >>> mmap_read_unlock(&init_mm); > >>> > >>> - free_vmemmap_page_list(&vmemmap_pages); > >>> + if (unlikely(reuse == start)) > >>> + free_vmemmap_pages_nonpcp(&vmemmap_pages); > >>> + else > >>> + free_vmemmap_pages(&vmemmap_pages); > >>> > >>> return ret; > >>> } > >>> @@ -512,6 +570,22 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h > >>> return true; > >>> } > >>> > >>> +/* > >>> + * Control if the page frame which the address of the head vmemmap associated > >>> + * with a HugeTLB page is mapped to should be replaced with a new page. The > >>> + * vmemmap pages are usually mapped with huge PMD mapping, the head vmemmap > >>> + * page frames is best freed to the buddy allocator once at an initial stage > >>> + * of system booting to reduce memory fragment. > >>> + */ > >>> +static bool vmemmap_remap_head __ro_after_init = true; > >>> + > >>> +static int __init vmemmap_remap_head_init(void) > >>> +{ > >>> + vmemmap_remap_head = false; > >>> + return 0; > >>> +} > >>> +core_initcall(vmemmap_remap_head_init); > >>> + > >>> /** > >>> * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages. > >>> * @h: struct hstate. > >>> @@ -537,6 +611,19 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head) > >>> vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; > >>> > >>> /* > >>> + * The vmemmap pages are usually mapped with huge PMD mapping. If the > >>> + * head vmemmap page is not freed to the buddy allocator, then those > >>> + * freed tail vmemmap pages cannot be merged into a big order chunk. > >>> + * The head vmemmap page frame can be replaced with a new allocated > >>> + * page and be freed to the buddy allocator, then those freed vmemmmap > >>> + * pages have the opportunity to be merged into larger contiguous pages > >>> + * to reduce memory fragment. vmemmap_remap_free() will do this if > >>> + * @vmemmap_remap_free is equal to @vmemmap_reuse. > >>> + */ > >>> + if (unlikely(vmemmap_remap_head)) > >>> + vmemmap_start = vmemmap_reuse; > >>> + > >> > >> Maybe it doesn't need to strictly early init vs late init. > >> > >> I wonder if we can still make this trick late-stage but when the head struct page is > >> aligned to PAGE_SIZE / sizeof(struct page), and when it is it means that we are safe > >> to replace the head vmemmap page provided we would only be covering this hugetlb page > >> data? Unless I am missing something and the check wouldn't be enough? > >> > >> A quick check with this snip added: > >> > >> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > >> index 2b97df8115fe..06e028734b1e 100644 > >> --- a/mm/hugetlb_vmemmap.c > >> +++ b/mm/hugetlb_vmemmap.c > >> @@ -331,7 +331,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > >> }; > >> gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; > >> int nid = page_to_nid((struct page *)start); > >> - struct page *page; > >> + struct page *page = NULL; > >> > >> /* > >> * Allocate a new head vmemmap page to avoid breaking a contiguous > >> @@ -341,7 +341,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > >> * more allocations of hugepages. Fallback to the currently > >> * mapped head page in case should it fail to allocate. > >> */ > >> - page = alloc_pages_node(nid, gfp_mask, 0); > >> + if (IS_ALIGNED(start, PAGE_SIZE)) > >> + page = alloc_pages_node(nid, gfp_mask, 0); > >> walk.head_page = page; > >> > >> /* > >> > >> > >> ... and it looks that it can still be effective just not as much, more or less as expected? > >> > >> # with 2M hugepages > >> > >> Before: > >> > >> Node 0, zone Normal, type Movable 76 28 10 4 1 0 > >> 0 0 1 1 15568 > >> > >> After (allocated 32400): > >> > >> Node 0, zone Normal, type Movable 135 328 219 198 155 106 > >> 72 41 23 0 0 > >> > >> Compared to my original patch where there weren't any pages left in the order-0..order-2: > >> > >> Node 0, zone Normal, type Movable 0 1 0 70 > >> 106 91 78 48 17 0 0 > >> > >> But still much better that without any of this: > >> > >> Node 0, zone Normal, type Movable 32174 31999 31642 104 > >> 58 24 16 4 2 0 0 > >> >