> On Nov 7, 2022, at 23:39, Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: > > Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed > back to page allocator is as following: for a 2M hugetlb page it will reuse > the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a > 1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially, > that means that it breaks the first 4K of a potentially contiguous chunk of > memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For > this reason the memory that it's free back to page allocator cannot be used > for hugetlb to allocate huge pages of the same size, but rather only of a > smaller huge page size: > > Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node > having 64G): > > * Before allocation: > Free pages count per migrate type at order 0 1 2 3 > 4 5 6 7 8 9 10 > ... > Node 0, zone Normal, type Movable 340 100 32 15 > 1 2 0 0 0 1 15558 > > $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > 31987 > > * After: > > Node 0, zone Normal, type Movable 30893 32006 31515 7 > 0 0 0 0 0 0 0 > > Notice how the memory freed back are put back into 4K / 8K / 16K page > pools. And it allocates a total of 31974 pages (63948M). > > To fix this behaviour rather than remapping one page (thus breaking the > contiguous block of memory backing the struct pages) repopulate with a new > page for the head vmemmap page. It will copying the data from the currently > mapped vmemmap page, and then remap it to this new page. Additionally, > change the remap_pte callback to look at the newly added walk::head_page > which needs to be mapped as r/w compared to the tail page vmemmap reuse > that uses r/o. > > The new head page is allocated by the caller of vmemmap_remap_free() given > that on restore it should still be using the same code path as before. Note > that, because right now one hugepage is remapped at a time, thus only one > free 4K page at a time is needed to remap the head page. Should it fail to > allocate said new page, it reuses the one that's already mapped just like > before. As a result, for every 64G of contiguous hugepages it can give back > 1G more of contiguous memory per 64G, while needing in total 128M new 4K > pages (for 2M hugetlb) or 256k (for 1G hugetlb). > > After the changes, try to assign a 64G node to hugetlb (on a 128G 2node > guest, each node with 64G): > > * Before allocation > Free pages count per migrate type at order 0 1 2 3 > 4 5 6 7 8 9 10 > ... > Node 0, zone Normal, type Movable 1 1 1 0 > 0 1 0 0 1 1 15564 > > $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > 32394 > > * After: > > Node 0, zone Normal, type Movable 0 50 97 108 > 96 81 70 46 18 0 0 Thanks for your work. > > > In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out > of the 32394 (64796M) allocated. So the memory freed back is indeed being > used back in hugetlb and there's no massive order-0..order-2 pages > accumulated unused. > > Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> > --- > Changes since v1[0]: > * Drop rw argument and check walk::head_page directly when there's > no reuse_page set (similar suggestion by Muchun Song to adjust > inside the remap_pte callback) > * Adjust TLB flush to cover the head page vaddr too (Muchun Song) > * Simplify the remap of head page in vmemmap_pte_range() > * Check start is aligned to PAGE_SIZE in vmemmap_remap_free() > > I've kept the same structure as in v1 compared to a chunk Muchun > pasted in the v1 thread[1] and thus I am not altering the calling > convention of vmemmap_remap_free()/vmemmap_restore_pte(). > The remapping of head page is not exactly a page that is reused, > compared to the r/o tail vmemmap pages remapping. So tiny semantic change, > albeit same outcome in pratice of changing the PTE and freeing the page, > with different permissions. It also made it simpler to gracefully fail > in case of page allocation failure, and logic simpler to follow IMHO. > > Let me know otherwise if I followed the wrong thinking. > > [0] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@xxxxxxxxxx/ > [1] https://lore.kernel.org/linux-mm/Yun1bJsnK%2F6MFc0b@FVFYT0MHHV2J/ > > --- > mm/hugetlb_vmemmap.c | 59 ++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 52 insertions(+), 7 deletions(-) > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index 7898c2c75e35..4298c44578e3 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -22,6 +22,7 @@ > * > * @remap_pte: called for each lowest-level entry (PTE). > * @nr_walked: the number of walked pte. > + * @head_page: the page which replaces the head vmemmap page. > * @reuse_page: the page which is reused for the tail vmemmap pages. > * @reuse_addr: the virtual address of the @reuse_page page. > * @vmemmap_pages: the list head of the vmemmap pages that can be freed > @@ -31,6 +32,7 @@ struct vmemmap_remap_walk { > void (*remap_pte)(pte_t *pte, unsigned long addr, > struct vmemmap_remap_walk *walk); > unsigned long nr_walked; > + struct page *head_page; This field is unnecessary. We can reuse ->reuse_page to implement the same functionality. I'll explain the reason later. > struct page *reuse_page; > unsigned long reuse_addr; > struct list_head *vmemmap_pages; > @@ -105,10 +107,26 @@ static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, > * remapping (which is calling @walk->remap_pte). > */ > if (!walk->reuse_page) { > - walk->reuse_page = pte_page(*pte); > + struct page *page = pte_page(*pte); > + > + /* > + * Copy the data from the original head, and remap to > + * the newly allocated page. > + */ > + if (walk->head_page) { > + memcpy(page_address(walk->head_page), > + page_address(page), PAGE_SIZE); > + walk->remap_pte(pte, addr, walk); > + page = walk->head_page; > + } > + > + walk->reuse_page = page; > + > /* > - * Because the reuse address is part of the range that we are > - * walking, skip the reuse address range. > + * Because the reuse address is part of the range that > + * we are walking or the head page was remapped to a > + * new page, skip the reuse address range. > + * . > */ > addr += PAGE_SIZE; > pte++; > @@ -204,11 +222,11 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end, > } while (pgd++, addr = next, addr != end); > > /* > - * We only change the mapping of the vmemmap virtual address range > - * [@start + PAGE_SIZE, end), so we only need to flush the TLB which > + * We change the mapping of the vmemmap virtual address range > + * [@start, end], so we only need to flush the TLB which > * belongs to the range. > */ This comment could go away, the reason I added it here is because it is a bit special here. I want to tell others why we don't flush the full range from @start to @end. Now, I think it can go away. > - flush_tlb_kernel_range(start + PAGE_SIZE, end); > + flush_tlb_kernel_range(start, end); > > return 0; > } > @@ -244,9 +262,21 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > * to the tail pages. > */ > pgprot_t pgprot = PAGE_KERNEL_RO; > - pte_t entry = mk_pte(walk->reuse_page, pgprot); > + struct page *reuse = walk->reuse_page; > struct page *page = pte_page(*pte); > + pte_t entry; > > + /* > + * When there's no walk::reuse_page, it means we allocated a new head > + * page (stored in walk::head_page) and copied from the old head page. > + * In that case use the walk::head_page as the page to remap. > + */ > + if (!reuse) { > + pgprot = PAGE_KERNEL; > + reuse = walk->head_page; > + } > + > + entry = mk_pte(reuse, pgprot); > list_add_tail(&page->lru, walk->vmemmap_pages); > set_pte_at(&init_mm, addr, pte, entry); > } > @@ -315,6 +345,21 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > .reuse_addr = reuse, > .vmemmap_pages = &vmemmap_pages, > }; > + gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; It is better to add __GFP_THISNODE here for performance. And replace __GFP_RETRY_MAYFAIL to __GFP_NORETRY to keep consistent with hugetlb_vmemmap_restore(). > + int nid = page_to_nid((struct page *)start); > + struct page *page = NULL; > + > + /* > + * Allocate a new head vmemmap page to avoid breaking a contiguous > + * block of struct page memory when freeing it back to page allocator > + * in free_vmemmap_page_list(). This will allow the likely contiguous > + * struct page backing memory to be kept contiguous and allowing for > + * more allocations of hugepages. Fallback to the currently > + * mapped head page in case should it fail to allocate. > + */ > + if (IS_ALIGNED((unsigned long)start, PAGE_SIZE)) I'm curious why we need this check. IIUC, this is unnecessary. > + page = alloc_pages_node(nid, gfp_mask, 0); > + walk.head_page = page; > > /* > * In order to make remapping routine most efficient for the huge pages, > -- > 2.17.2 > I have implemented a version based on yours, which does not introduce ->head_page field (Not test if it works). Seems to be simple. Thanks. diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index c98805d5b815..8ee94f6a6697 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -202,12 +202,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end, return ret; } while (pgd++, addr = next, addr != end); - /* - * We only change the mapping of the vmemmap virtual address range - * [@start + PAGE_SIZE, end), so we only need to flush the TLB which - * belongs to the range. - */ - flush_tlb_kernel_range(start + PAGE_SIZE, end); + flush_tlb_kernel_range(start, end); return 0; } @@ -246,6 +241,12 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, pte_t entry = mk_pte(walk->reuse_page, pgprot); struct page *page = pte_page(*pte); + /* The case of remapping the head vmemmap page. */ + if (unlikely(addr == walk->reuse_addr)) { + list_del(&walk->reuse_page->lru); + entry = mk_pte(walk->reuse_page, PAGE_KERNEL); + } + list_add_tail(&page->lru, walk->vmemmap_pages); set_pte_at(&init_mm, addr, pte, entry); } @@ -310,6 +311,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, .reuse_addr = reuse, .vmemmap_pages = &vmemmap_pages, }; + int nid = page_to_nid((struct page *)start); + gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | __GFP_NOWARN; /* * In order to make remapping routine most efficient for the huge pages, @@ -326,6 +329,20 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, */ BUG_ON(start - reuse != PAGE_SIZE); + /* + * Allocate a new head vmemmap page to avoid breaking a contiguous + * block of struct page memory when freeing it back to page allocator + * in free_vmemmap_page_list(). This will allow the likely contiguous + * struct page backing memory to be kept contiguous and allowing for + * more allocations of hugepages. Fallback to the currently + * mapped head page in case should it fail to allocate. + */ + walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0); + if (walk.reuse_page) { + copy_page(page_to_virt(walk.reuse_page), walk.reuse_addr); + list_add(&walk.reuse_page->lru, &vmemmap_pages); + } + mmap_read_lock(&init_mm); ret = vmemmap_remap_range(reuse, end, &walk); if (ret && walk.nr_walked) {