On 7/28/21 7:55 AM, Dan Williams wrote: > On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: >> >> A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it > > Maybe s/compound devmap/compound devmap/ per the other planned usage > of "devmap" in the implementation? > Yeap. I am replacing pagemap with devmap -- hopefully better done than the s/align/geometry which there's still some leftovers in this series. >> means that pages are mapped at a given huge page alignment and utilize >> uses compound pages as opposed to order-0 pages. >> >> Take advantage of the fact that most tail pages look the same (except >> the first two) to minimize struct page overhead. Allocate a separate >> page for the vmemmap area which contains the head page and separate for >> the next 64 pages. The rest of the subsections then reuse this tail >> vmemmap page to initialize the rest of the tail pages. >> >> Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and >> when initializing compound pagemap with big enough @align (e.g. 1G > > s/@align/@geometry/? > Yeap (and the previous mention too in the hunk before this one). >> PUD) it will cross various sections. > > s/will cross various/may cross multiple/ > OK >> To be able to reuse tail pages >> across sections belonging to the same gigantic page, fetch the >> @range being mapped (nr_ranges + 1). If the section being mapped is >> not offset 0 of the @align, then lookup the PFN of the struct page >> address that precedes it and use that to populate the entire >> section. > > This sounds like code being read aloud. I would just say something like: > > "The vmemmap code needs to consult @pgmap so that multiple sections > that all map the same tail data can refer back to the first copy of > that data for a given gigantic page." > Fixed. >> >> On compound pagemaps with 2M align, this mechanism lets 6 pages be >> saved out of the 8 necessary PFNs necessary to set the subsection's >> 512 struct pages being mapped. On a 1G compound pagemap it saves >> 4094 pages. >> >> Altmap isn't supported yet, given various restrictions in altmap pfn >> allocator, thus fallback to the already in use vmemmap_populate(). It >> is worth noting that altmap for devmap mappings was there to relieve the >> pressure of inordinate amounts of memmap space to map terabytes of pmem. >> With compound pages the motivation for altmaps for pmem gets reduced. > > Looks good just some minor comments / typo fixes, and some requests > for a few more helper functions. > >> >> Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> >> --- >> Documentation/vm/vmemmap_dedup.rst | 27 +++++- >> include/linux/mm.h | 2 +- >> mm/memremap.c | 1 + >> mm/sparse-vmemmap.c | 133 +++++++++++++++++++++++++++-- >> 4 files changed, 151 insertions(+), 12 deletions(-) >> >> diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst >> index 215ae2ef3bce..42830a667c2a 100644 >> --- a/Documentation/vm/vmemmap_dedup.rst >> +++ b/Documentation/vm/vmemmap_dedup.rst >> @@ -2,9 +2,12 @@ >> >> .. _vmemmap_dedup: >> >> -================================== >> -Free some vmemmap pages of HugeTLB >> -================================== >> +================================================= >> +Free some vmemmap pages of HugeTLB and Device DAX > > How about "A vmemmap diet for HugeTLB and Device DAX" > > ...because in the HugeTLB case it is dynamically remapping and freeing > the pages after the fact, while Device-DAX is avoiding the allocation > in the first instance. > Yeap. Better title indeed, fixed. [...] >> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, >> + unsigned long start, >> + unsigned long end, int node, >> + struct dev_pagemap *pgmap) >> +{ >> + unsigned long offset, size, addr; >> + >> + /* >> + * For compound pages bigger than section size (e.g. x86 1G compound >> + * pages with 2M subsection size) fill the rest of sections as tail >> + * pages. >> + * >> + * Note that memremap_pages() resets @nr_range value and will increment >> + * it after each range successful onlining. Thus the value or @nr_range >> + * at section memmap populate corresponds to the in-progress range >> + * being onlined here. >> + */ >> + offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start; >> + if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) && >> + pgmap_geometry(pgmap) > SUBSECTION_SIZE) { > > How about moving the last 3 lines plus the comment to a helper so this > becomes something like: > > if (compound_section_index(start_pfn, pgmap)) > > ...where it is clear that for the Nth section in a compound page where > N is > 0, it can lookup the page data to reuse. > Definitely more readable. Here's what I have so far (already with the change of pgmap_geometry() to be nr of pages): +/* + * For compound pages bigger than section size (e.g. x86 1G compound + * pages with 2M subsection size) fill the rest of sections as tail + * pages. + * + * Note that memremap_pages() resets @nr_range value and will increment + * it after each range successful onlining. Thus the value or @nr_range + * at section memmap populate corresponds to the in-progress range + * being onlined here. + */ +static bool compound_section_index(unsigned long start_pfn, + struct dev_pagemap *pgmap) +{ + unsigned long geometry_size = pgmap_geometry(pgmap) << PAGE_SHIFT; + unsigned long offset = PFN_PHYS(start_pfn) - + pgmap->ranges[pgmap->nr_range].start; + + return !IS_ALIGNED(offset, geometry_size) && + geometry_size > SUBSECTION_SIZE; +} + static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, unsigned long start, unsigned long end, int node, struct dev_pagemap *pgmap) { - unsigned long geometry_size = pgmap_geometry(pgmap) << PAGE_SHIFT; unsigned long offset, size, addr; - /* - * For compound pages bigger than section size (e.g. x86 1G compound - * pages with 2M subsection size) fill the rest of sections as tail - * pages. - * - * Note that memremap_pages() resets @nr_range value and will increment - * it after each range successful onlining. Thus the value or @nr_range - * at section memmap populate corresponds to the in-progress range - * being onlined here. - */ - offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start; - if (!IS_ALIGNED(offset, geometry_size) && - geometry_size > SUBSECTION_SIZE) { + if (compound_section_index(start_pfn, pgmap)) { pte_t *ptep; addr = start - PAGE_SIZE; > >> + pte_t *ptep; >> + >> + addr = start - PAGE_SIZE; >> + >> + /* >> + * Sections are populated sequently and in sucession meaning >> + * this section being populated wouldn't start if the >> + * preceding one wasn't successful. So there is a guarantee that >> + * the previous struct pages are mapped when trying to lookup >> + * the last tail page. > > I think you can cut this down to: > > "Assuming sections are populated sequentially, the previous section's > page data can be reused." > OK. > ...and maybe this can be a helper like: > > compound_section_tail_page()? > It makes this patch more readable. Albeit doing this means we might need a compound_section_tail_huge_page (...) > >> + * the last tail page. > >> + ptep = pte_offset_kernel(pmd_off_k(addr), addr); >> + if (!ptep) >> + return -ENOMEM; >> + >> + /* >> + * Reuse the page that was populated in the prior iteration >> + * with just tail struct pages. >> + */ >> + return vmemmap_populate_range(start, end, node, >> + pte_page(*ptep)); >> + } The last patch separates the above check and uses the PMD (and the @offset) to reuse the PMD of the compound_section_tail_page(). So this might mean that we introduce in the last patch some sort of compound_section_tail_huge_page() for the pmd page. So far it the second change doesn't seem to translate an obvious improvement in readability. Pasted below, Here's compound_section_tail_page() [...] diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index d7419b5d54d7..31f94802c095 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -673,6 +673,23 @@ static bool __meminit compound_section_index(unsigned long start_pfn, geometry_size > SUBSECTION_SIZE; } +static struct page * __meminit compound_section_tail_page(unsigned long addr) +{ + pte_t *ptep; + + addr -= PAGE_SIZE; + + /* + * Assuming sections are populated sequentially, the previous section's + * page data can be reused. + */ + ptep = pte_offset_kernel(pmd_off_k(addr), addr); + if (!ptep) + return NULL; + + return pte_page(*ptep); +} + static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, unsigned long start, unsigned long end, int node, @@ -681,27 +698,17 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, unsigned long offset, size, addr; if (compound_section_index(start_pfn, pgmap)) { - pte_t *ptep; - - addr = start - PAGE_SIZE; + struct page *page; - /* - * Sections are populated sequently and in sucession meaning - * this section being populated wouldn't start if the - * preceding one wasn't successful. So there is a guarantee that - * the previous struct pages are mapped when trying to lookup - * the last tail page. - */ - ptep = pte_offset_kernel(pmd_off_k(addr), addr); - if (!ptep) + page = compound_section_tail_page(start); + if (!page) return -ENOMEM; /* * Reuse the page that was populated in the prior iteration * with just tail struct pages. */ - return vmemmap_populate_range(start, end, node, - pte_page(*ptep)); + return vmemmap_populate_range(start, end, node, page); } size = min(end - start, pgmap_geometry(pgmap) * sizeof(struct page)); [...] And here's compound_section_tail_huge_page() (for the last patch in the series): @@ -690,6 +727,33 @@ static struct page * __meminit compound_section_tail_page(unsigned long addr) return pte_page(*ptep); } +static struct page * __meminit compound_section_tail_huge_page(unsigned long addr, + unsigned long offset, struct dev_pagemap *pgmap) +{ + unsigned long geometry_size = pgmap_geometry(pgmap) << PAGE_SHIFT; + pmd_t *pmdp; + + addr -= PAGE_SIZE; + + /* + * Assuming sections are populated sequentially, the previous section's + * page data can be reused. + */ + pmdp = pmd_off_k(addr); + if (!pmdp) + return ERR_PTR(-ENOMEM); + + /* + * Reuse the tail pages vmemmap pmd page + * See layout diagram in Documentation/vm/vmemmap_dedup.rst + */ + if (offset % geometry_size > PFN_PHYS(PAGES_PER_SECTION)) + return pmd_page(*pmdp); + + /* No reusable PMD fallback to PTE tail page*/ + return NULL; +} + static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, unsigned long start, unsigned long end, int node, @@ -697,14 +761,22 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, { unsigned long offset, size, addr; - if (compound_section_index(start_pfn, pgmap)) { - struct page *page; + if (compound_section_index(start_pfn, pgmap, &offset)) { + struct page *page, *hpage; + + hpage = compound_section_tail_huge_page(addr, offset); + if (IS_ERR(hpage)) + return -ENOMEM; + else if (hpage) + return vmemmap_populate_pmd_range(start, end, node, + hpage); page = compound_section_tail_page(start); if (!page) return -ENOMEM; /* + * Populate the tail pages vmemmap pmd page. * Reuse the page that was populated in the prior iteration * with just tail struct pages. */