On 5/6/21 2:18 AM, Dan Williams wrote: > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: >> >> A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it >> means that pages are mapped at a given huge page alignment and utilize >> uses compound pages as opposed to order-0 pages. >> >> To minimize struct page overhead we take advantage of the fact that > > I'm not sure if Andrew is a member of the "we" police, but I am on > team "imperative tense". > That was my mistake once again. You did mentioned it over the RFC. A few days after submitting the series I noticed[0] this patch and the next one had still a few instances of misplaced usage of 'we', in addition to a little unit mistake I made over the cover-letter. [0] https://lore.kernel.org/linux-mm/a2b7dc3a-3f1d-b8e1-4770-e261055f2b17@xxxxxxxxxx/ > "Take advantage of the fact that most tail pages look the same (except > the first two) to minimize struct page overhead." > > ...I think all the other "we"s below can be dropped without losing any meaning. > Indeed. >> most tail pages look the same (except the first two). We allocate a >> separate page for the vmemmap area which contains the head page and >> separate for the next 64 pages. The rest of the subsections then reuse >> this tail vmemmap page to initialize the rest of the tail pages. >> >> Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and >> when initializing compound pagemap with big enough @align (e.g. 1G >> PUD) it will cross various sections. To be able to reuse tail pages >> across sections belonging to the same gigantic page we fetch the >> @range being mapped (nr_ranges + 1). If the section being mapped is >> not offset 0 of the @align, then lookup the PFN of the struct page >> address that preceeds it and use that to populate the entire > > s/preceeds/precedes/ > /me nods >> section. >> >> On compound pagemaps with 2M align, this lets mechanism saves 6 pages > > s/lets mechanism saves 6 pages/this mechanism lets 6 pages be saved/ > Will fix. >> out of the 8 necessary PFNs necessary to set the subsection's 512 >> struct pages being mapped. On a 1G compound pagemap it saves >> 4094 pages. >> >> Altmap isn't supported yet, given various restrictions in altmap pfn >> allocator, thus fallback to the already in use vmemmap_populate(). > > Perhaps also note that altmap for devmap mappings was there to relieve > the pressure of inordinate amounts of memmap space to map terabytes of > PMEM. With compound pages the motivation for altmaps for PMEM is > reduced. > OK, I suppose it's worth highlighting it. But perhaps how 'reduced' this motivation is still not 100% clear to me. Like we discussed over the RFC, the DEVMAP_PMD geometry we still take a non-trivial hit of ~4G per Tb. And altmap is a rather quick way of having heavily disproportioned RAM to PMEM ratios working without pay the RAM penalty. Specially with the pinning numbers so heavily reduced. >> >> Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> >> --- >> include/linux/mm.h | 2 +- >> mm/memremap.c | 1 + >> mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++---- >> 3 files changed, 131 insertions(+), 11 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 61474602c2b1..49d717ae40ae 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node); >> pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node); >> pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node); >> pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, >> - struct vmem_altmap *altmap); >> + struct vmem_altmap *altmap, void *block); >> void *vmemmap_alloc_block(unsigned long size, int node); >> struct vmem_altmap; >> void *vmemmap_alloc_block_buf(unsigned long size, int node, >> diff --git a/mm/memremap.c b/mm/memremap.c >> index d160853670c4..2e6bc0b1ff00 100644 >> --- a/mm/memremap.c >> +++ b/mm/memremap.c >> @@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) >> { >> struct mhp_params params = { >> .altmap = pgmap_altmap(pgmap), >> + .pgmap = pgmap, >> .pgprot = PAGE_KERNEL, >> }; >> const int nr_range = pgmap->nr_range; >> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c >> index 8814876edcfa..f57c5eada099 100644 >> --- a/mm/sparse-vmemmap.c >> +++ b/mm/sparse-vmemmap.c >> @@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node, >> } >> >> pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, >> - struct vmem_altmap *altmap) >> + struct vmem_altmap *altmap, void *block) > > For type-safety I would make this argument a 'struct page *' and drop > the virt_to_page(). > Will do. >> { >> pte_t *pte = pte_offset_kernel(pmd, addr); >> if (pte_none(*pte)) { >> pte_t entry; >> - void *p; >> - >> - p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap); >> - if (!p) >> - return NULL; >> + void *p = block; >> + >> + if (!block) { >> + p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap); >> + if (!p) >> + return NULL; >> + } else if (!altmap) { >> + get_page(virt_to_page(block)); > > This is either super subtle, or there is a missing put_page(). I'm > assuming that in the shutdown path the same page will be freed > multiple times because the same page is mapped multiple times. > It's the former (subtlety): 1) When a PTE/PMD entry is freed from the init_mm there's a a free_pages() call to this page allocated above. free_pages() at the top does a put_page_testzero() prior to freeing the actual page. 2) Given this codepath is solely used by pgmap infra, we already have the page allocator is available (i.e. slab_is_available()), hence we will always go towards the alloc_pages_node() codepath. > Have you validated that the accounting is correct and all allocated > pages get freed? I feel this needs more than a comment, it needs a > test to validate that the accounting continues to happen correctly as > future vmemmap changes land. > I can add a comment *at least*. I checked the accounting. But let me be extra pedantic on this part and recheck this once again as I settled with this part some time ago. I will followup on this chunk, should this part be broken in some way. >> + } >> entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL); >> set_pte_at(&init_mm, addr, pte, entry); >> } >> @@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node) >> } >> >> static int __meminit vmemmap_populate_address(unsigned long addr, int node, >> - struct vmem_altmap *altmap) >> + struct vmem_altmap *altmap, >> + void *page, void **ptr) >> { >> pgd_t *pgd; >> p4d_t *p4d; >> @@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node, >> pmd = vmemmap_pmd_populate(pud, addr, node); >> if (!pmd) >> return -ENOMEM; >> - pte = vmemmap_pte_populate(pmd, addr, node, altmap); >> + pte = vmemmap_pte_populate(pmd, addr, node, altmap, page); >> if (!pte) >> return -ENOMEM; >> vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); >> + >> + if (ptr) >> + *ptr = __va(__pfn_to_phys(pte_pfn(*pte))); >> + return 0; >> } >> >> int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end, >> @@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end, >> unsigned long addr = start; >> >> for (; addr < end; addr += PAGE_SIZE) { >> - if (vmemmap_populate_address(addr, node, altmap)) >> + if (vmemmap_populate_address(addr, node, altmap, NULL, NULL)) >> + return -ENOMEM; >> + } >> + >> + return 0; >> +} >> + >> +static int __meminit vmemmap_populate_range(unsigned long start, >> + unsigned long end, >> + int node, void *page) >> +{ >> + unsigned long addr = start; >> + >> + for (; addr < end; addr += PAGE_SIZE) { >> + if (vmemmap_populate_address(addr, node, NULL, page, NULL)) >> + return -ENOMEM; >> + } >> + >> + return 0; >> +} >> + >> +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node, >> + void **ptr) >> +{ >> + return vmemmap_populate_address(addr, node, NULL, NULL, ptr); >> +} >> + >> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr) > > I think this can be replaced with a call to follow_pte(&init_mm...). > Ah, of course! That would shorten things up too. >> +{ >> + pgd_t *pgd; >> + p4d_t *p4d; >> + pud_t *pud; >> + pmd_t *pmd; >> + pte_t *pte; >> + >> + pgd = pgd_offset_k(addr); >> + if (pgd_none(*pgd)) >> + return NULL; >> + >> + p4d = p4d_offset(pgd, addr); >> + if (p4d_none(*p4d)) >> + return NULL; >> + >> + pud = pud_offset(p4d, addr); >> + if (pud_none(*pud)) >> + return NULL; >> + >> + pmd = pmd_offset(pud, addr); >> + if (pmd_none(*pmd)) >> + return NULL; >> + >> + pte = pte_offset_kernel(pmd, addr); >> + if (pte_none(*pte)) >> + return NULL; >> + >> + return pte; >> +} >> + >> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, >> + unsigned long start, >> + unsigned long end, int node, >> + struct dev_pagemap *pgmap) >> +{ >> + unsigned long offset, size, addr; >> + >> + /* >> + * For compound pages bigger than section size (e.g. 1G) fill the rest > > Oh, I had to read this a couple times because I thought the "e.g." was > referring to section size. How about: > > s/(e.g. 1G)/(e.g. x86 1G compound pages with 2M subsection size)/ > Will fix. But instead I'll fix to: e.g. x86 1G compound pages with 128M section size Because vmemmap_populate_compound_pages() is called per section, while being subsection-aligned. >> + * of sections as tail pages. >> + * >> + * Note that memremap_pages() resets @nr_range value and will increment >> + * it after each range successful onlining. Thus the value or @nr_range >> + * at section memmap populate corresponds to the in-progress range >> + * being onlined that we care about. >> + */ >> + offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start; >> + if (!IS_ALIGNED(offset, pgmap_align(pgmap)) && >> + pgmap_align(pgmap) > SUBSECTION_SIZE) { >> + pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE); >> + >> + if (!ptep) >> + return -ENOMEM; > > Side note: I had been resistant to adding 'struct page' for altmap > pages, but that would be a requirement to enable compound pages + > altmap. > Fair enough. I was thinking about something a little simpler like adding a refcount to altmap pfn allocator, to avoid arch changes and be a little more tidy on space (16bytes rather than 64bytes). But I suspect I can avoid the said arch changes even with a backend struct page. >> + > > Perhaps a comment here to the effect "recall the page that was > allocated in a prior iteration and fill it in with tail pages". > Will add. >> + return vmemmap_populate_range(start, end, node, >> + page_to_virt(pte_page(*ptep))); > > If the @block parameter changes to a 'struct page *' it also saves > some typing here. > Indeed. >> + } >> + >> + size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page)); >> + for (addr = start; addr < end; addr += size) { >> + unsigned long next = addr, last = addr + size; >> + void *block; >> + >> + /* Populate the head page vmemmap page */ >> + if (vmemmap_populate_page(addr, node, NULL)) >> + return -ENOMEM; >> + >> + /* Populate the tail pages vmemmap page */ >> + block = NULL; >> + next = addr + PAGE_SIZE; >> + if (vmemmap_populate_page(next, node, &block)) >> + return -ENOMEM; >> + >> + /* Reuse the previous page for the rest of tail pages */ >> + next += PAGE_SIZE; >> + if (vmemmap_populate_range(next, last, node, block)) >> return -ENOMEM; >> } >> >> @@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, >> { >> unsigned long start = (unsigned long) pfn_to_page(pfn); >> unsigned long end = start + nr_pages * sizeof(struct page); >> + unsigned int align = pgmap_align(pgmap); >> + int r; >> >> if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) || >> !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION))) >> return NULL; >> >> - if (vmemmap_populate(start, end, nid, altmap)) >> + if (align > PAGE_SIZE && !altmap) > > I also think this code will read better the devmap_geometry enum. > True. I am starting to like the ring of it with @geometry rather than @align to represent how metadata is used. Should avoid confusion between device-dax @align and pagemap @align. >> + r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); >> + else >> + r = vmemmap_populate(start, end, nid, altmap); >> + >> + if (r < 0) >> return NULL; >> >> return pfn_to_page(pfn); >> -- >> 2.17.1 >>