On Fri, Feb 5, 2021 at 4:54 PM Oscar Salvador <osalvador@xxxxxxx> wrote: > > On Thu, Feb 04, 2021 at 11:50:38AM +0800, Muchun Song wrote: > > Every HugeTLB has more than one struct page structure. We __know__ that > > we only use the first 4(HUGETLB_CGROUP_MIN_ORDER) struct page structures > > to store metadata associated with each HugeTLB. > > > > There are a lot of struct page structures associated with each HugeTLB > > page. For tail pages, the value of compound_head is the same. So we can > > reuse first page of tail page structures. We map the virtual addresses > > of the remaining pages of tail page structures to the first tail page > > struct, and then free these page frames. Therefore, we need to reserve > > two pages as vmemmap areas. > > > > When we allocate a HugeTLB page from the buddy, we can free some vmemmap > > pages associated with each HugeTLB page. It is more appropriate to do it > > in the prep_new_huge_page(). > > > > The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap > > pages associated with a HugeTLB page can be freed, returns zero for > > now, which means the feature is disabled. We will enable it once all > > the infrastructure is there. > > > > Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx> > > --- > > [...] > > > +void free_huge_page_vmemmap(struct hstate *h, struct page *head) > > +{ > > + unsigned long vmemmap_addr = (unsigned long)head; > > + unsigned long vmemmap_end, vmemmap_reuse; > > + > > + if (!free_vmemmap_pages_per_hpage(h)) > > + return; > > + > > + vmemmap_addr += RESERVE_VMEMMAP_SIZE; > > + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h); > > + vmemmap_reuse = vmemmap_addr - PAGE_SIZE; > > + > > + /* > > + * Remap the vmemmap virtual address range [@vmemmap_addr, @vmemmap_end) > > + * to the page which @vmemmap_reuse is mapped to, then free the vmemmap > > + * pages which the range are mapped to. > > "then free the pages which the range [@vmemmap_addr, @vmemmap_end] is mapped to." > > I am not a native but sounds better to me. Me too. But I believe you are right. :-) > > > + */ > > + vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse); > > +} > > diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h > > new file mode 100644 > > index 000000000000..6923f03534d5 > > --- /dev/null > > +++ b/mm/hugetlb_vmemmap.h > > [...] > > > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > > index 16183d85a7d5..50c1dc00b686 100644 > > --- a/mm/sparse-vmemmap.c > > +++ b/mm/sparse-vmemmap.c > > @@ -27,8 +27,215 @@ > > #include <linux/spinlock.h> > > #include <linux/vmalloc.h> > > #include <linux/sched.h> > > +#include <linux/pgtable.h> > > +#include <linux/bootmem_info.h> > > + > > #include <asm/dma.h> > > #include <asm/pgalloc.h> > > +#include <asm/tlbflush.h> > > + > > +/** > > + * vmemmap_remap_walk - walk vmemmap page table > > + * > > + * @remap_pte: called for each non-empty PTE (lowest-level) entry. > > Well, we BUG_ON on empty PTE, so not sure that pointing out here is worth. > It sounds like we do nothing when it's empty. > Maybe: > > "called for each lowest-level entry (PTE)" Thanks. I will update this. > > > + * @reuse_page: the page which is reused for the tail vmemmap pages. > > + * @reuse_addr: the virtual address of the @reuse_page page. > > + * @vmemmap_pages: the list head of the vmemmap pages that can be freed. > > + */ > > +struct vmemmap_remap_walk { > > + void (*remap_pte)(pte_t *pte, unsigned long addr, > > + struct vmemmap_remap_walk *walk); > > + struct page *reuse_page; > > + unsigned long reuse_addr; > > + struct list_head *vmemmap_pages; > > +}; > > + > > +static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, > > + unsigned long end, > > + struct vmemmap_remap_walk *walk) > > +{ > > + pte_t *pte; > > + > > + pte = pte_offset_kernel(pmd, addr); > > + > > + /* > > + * The reuse_page is found 'first' in table walk before we start > > + * remapping (which is calling @walk->remap_pte). > > + */ > > + if (!walk->reuse_page) { > > + BUG_ON(pte_none(*pte) || walk->reuse_addr != addr); > > I would rather have them in separate lines: > BUG_ON(pte_none(*pte)); > BUG_ON(walk->reuse_addr != addr)); > > It helps when trying to figure out when we explode. One could dig in the > registers, but let's make it easier to find out. OK. Will do. > > > + > > [...] > > > > +static void vmemmap_remap_range(unsigned long start, unsigned long end, > > + struct vmemmap_remap_walk *walk) > > +{ > > + unsigned long addr = start; > > + unsigned long next; > > + pgd_t *pgd; > > + > > + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE)); > > + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE)); > > + > > + pgd = pgd_offset_k(addr); > > + do { > > + BUG_ON(pgd_none(*pgd)); > > + > > + next = pgd_addr_end(addr, end); > > + vmemmap_p4d_range(pgd, addr, next, walk); > > + } while (pgd++, addr = next, addr != end); > > + > > + /* > > + * We do not change the mapping of the vmemmap virtual address range > > + * [@start, @start + PAGE_SIZE) which belongs to the reuse range. > > + * So we not need to flush the TLB. > > + */ > > + flush_tlb_kernel_range(start + PAGE_SIZE, end); > > I find that comment a bit confusing. I would rather describe what are we > flushing instead of what we are not. > OK. Will update it. > > > +} > > + > > +/* > > + * Free a vmemmap page. A vmemmap page can be allocated from the memblock > > + * allocator or buddy allocator. If the PG_reserved flag is set, it means > > + * that it allocated from the memblock allocator, just free it via the > > + * free_bootmem_page(). Otherwise, use __free_page(). > > + */ > > +static inline void free_vmemmap_page(struct page *page) > > +{ > > + if (PageReserved(page)) > > + free_bootmem_page(page); > > + else > > + __free_page(page); > > +} > > + > > +/* Free a list of the vmemmap pages */ > > +static void free_vmemmap_page_list(struct list_head *list) > > +{ > > + struct page *page, *next; > > + > > + list_for_each_entry_safe(page, next, list, lru) { > > + list_del(&page->lru); > > + free_vmemmap_page(page); > > + } > > +} > > + > > +static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > > + struct vmemmap_remap_walk *walk) > > +{ > > + /* > > + * Remap the tail pages as read-only to catch illegal write operation > > + * to the tail pages. > > + */ > > + pgprot_t pgprot = PAGE_KERNEL_RO; > > + pte_t entry = mk_pte(walk->reuse_page, pgprot); > > + struct page *page = pte_page(*pte); > > + > > + list_add(&page->lru, walk->vmemmap_pages); > > + set_pte_at(&init_mm, addr, pte, entry); > > +} > > + > > +/** > > + * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end) > > + * to the page which @reuse is mapped to, then free vmemmap > > + * which the range are mapped to. > > + * @start: start address of the vmemmap virtual address range that we want > > + * to remap. > > + * @end: end address of the vmemmap virtual address range that we want to > > + * remap. > > + * @reuse: reuse address. > > + * > > + * Note: This function depends on vmemmap being base page mapped. Please make > > + * sure that the architecture disables PMD mapping of vmemmap pages when calling > > + * this function. > > Well, we do not really depend on the architecture to not map the vmemmap range > with PMDs, right? IIUC, that is driven by your boot parameter (patch#5), which > overrides whatever the architecture can do. Right. I will rework the comment here. > > Functional changes look good to me, so with all the above fixes, you can add: > > Reviewed-by: Oscar Salvador <osalvador@xxxxxxx> > Very thanks. > > -- > Oscar Salvador > SUSE L3