On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: > > Currently, for compound PUD mappings, the implementation consumes 40MB > per TB but it can be optimized to 16MB per TB with the approach > detailed below. > > Right now basepages are used to populate the PUD tail pages, and it > picks the address of the previous page of the subsection that precedes > the memmap being initialized. This is done when a given memmap > address isn't aligned to the pgmap @geometry (which is safe to do because > @ranges are guaranteed to be aligned to @geometry). > > For pagemaps with an align which spans various sections, this means > that PMD pages are unnecessarily allocated for reusing the same tail > pages. Effectively, on x86 a PUD can span 8 sections (depending on > config), and a page is being allocated a page for the PMD to reuse > the tail vmemmap across the rest of the PTEs. In short effecitvely the > PMD cover the tail vmemmap areas all contain the same PFN. So instead > of doing this way, populate a new PMD on the second section of the > compound page (tail vmemmap PMD), and then the following sections > utilize the preceding PMD previously populated which only contain > tail pages). > > After this scheme for an 1GB pagemap aligned area, the first PMD > (section) would contain head page and 32767 tail pages, where the > second PMD contains the full 32768 tail pages. The latter page gets > its PMD reused across future section mapping of the same pagemap. > > Besides fewer pagetable entries allocated, keeping parity with > hugepages in the directmap (as done by vmemmap_populate_hugepages()), > this further increases savings per compound page. Rather than > requiring 8 PMD page allocations only need 2 (plus two base pages > allocated for head and tail areas for the first PMD). 2M pages still > require using base pages, though. This looks good to me now, modulo the tail_page helper discussed previously. Thanks for the diagram, makes it clearer what's happening. I don't see any red flags that would prevent a reviewed-by when you send the next spin. > > Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> > --- > Documentation/vm/vmemmap_dedup.rst | 109 +++++++++++++++++++++++++++++ > include/linux/mm.h | 3 +- > mm/sparse-vmemmap.c | 74 +++++++++++++++++--- > 3 files changed, 174 insertions(+), 12 deletions(-) > > diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst > index 42830a667c2a..96d9f5f0a497 100644 > --- a/Documentation/vm/vmemmap_dedup.rst > +++ b/Documentation/vm/vmemmap_dedup.rst > @@ -189,3 +189,112 @@ at a later stage when we populate the sections. > It only use 3 page structs for storing all information as opposed > to 4 on HugeTLB pages. This does not affect memory savings between both. > > +Additionally, it further extends the tail page deduplication with 1GB > +device-dax compound pages. > + > +E.g.: A 1G device-dax page on x86_64 consists in 4096 page frames, split > +across 8 PMD page frames, with the first PMD having 2 PTE page frames. > +In total this represents a total of 40960 bytes per 1GB page. > + > +Here is how things look after the previously described tail page deduplication > +technique. > + > + device-dax page frames struct pages(4096 pages) page frame(2 pages) > + +-----------+ -> +----------+ --> +-----------+ mapping to +-------------+ > + | | | 0 | | 0 | -------------> | 0 | > + | | +----------+ +-----------+ +-------------+ > + | | | 1 | -------------> | 1 | > + | | +-----------+ +-------------+ > + | | | 2 | ----------------^ ^ ^ ^ ^ ^ ^ > + | | +-----------+ | | | | | | > + | | | 3 | ------------------+ | | | | | > + | | +-----------+ | | | | | > + | | | 4 | --------------------+ | | | | > + | PMD 0 | +-----------+ | | | | > + | | | 5 | ----------------------+ | | | > + | | +-----------+ | | | > + | | | .. | ------------------------+ | | > + | | +-----------+ | | > + | | | 511 | --------------------------+ | > + | | +-----------+ | > + | | | > + | | | > + | | | > + +-----------+ page frames | > + +-----------+ -> +----------+ --> +-----------+ mapping to | > + | | | 1 .. 7 | | 512 | ----------------------------+ > + | | +----------+ +-----------+ | > + | | | .. | ----------------------------+ > + | | +-----------+ | > + | | | .. | ----------------------------+ > + | | +-----------+ | > + | | | .. | ----------------------------+ > + | | +-----------+ | > + | | | .. | ----------------------------+ > + | PMD | +-----------+ | > + | 1 .. 7 | | .. | ----------------------------+ > + | | +-----------+ | > + | | | .. | ----------------------------+ > + | | +-----------+ | > + | | | 4095 | ----------------------------+ > + +-----------+ +-----------+ > + > +Page frames of PMD 1 through 7 are allocated and mapped to the same PTE page frame > +that contains stores tail pages. As we can see in the diagram, PMDs 1 through 7 > +all look like the same. Therefore we can map PMD 2 through 7 to PMD 1 page frame. > +This allows to free 6 vmemmap pages per 1GB page, decreasing the overhead per > +1GB page from 40960 bytes to 16384 bytes. > + > +Here is how things look after PMD tail page deduplication. > + > + device-dax page frames struct pages(4096 pages) page frame(2 pages) > + +-----------+ -> +----------+ --> +-----------+ mapping to +-------------+ > + | | | 0 | | 0 | -------------> | 0 | > + | | +----------+ +-----------+ +-------------+ > + | | | 1 | -------------> | 1 | > + | | +-----------+ +-------------+ > + | | | 2 | ----------------^ ^ ^ ^ ^ ^ ^ > + | | +-----------+ | | | | | | > + | | | 3 | ------------------+ | | | | | > + | | +-----------+ | | | | | > + | | | 4 | --------------------+ | | | | > + | PMD 0 | +-----------+ | | | | > + | | | 5 | ----------------------+ | | | > + | | +-----------+ | | | > + | | | .. | ------------------------+ | | > + | | +-----------+ | | > + | | | 511 | --------------------------+ | > + | | +-----------+ | > + | | | > + | | | > + | | | > + +-----------+ page frames | > + +-----------+ -> +----------+ --> +-----------+ mapping to | > + | | | 1 | | 512 | ----------------------------+ > + | | +----------+ +-----------+ | > + | | ^ ^ ^ ^ ^ ^ | .. | ----------------------------+ > + | | | | | | | | +-----------+ | > + | | | | | | | | | .. | ----------------------------+ > + | | | | | | | | +-----------+ | > + | | | | | | | | | .. | ----------------------------+ > + | | | | | | | | +-----------+ | > + | | | | | | | | | .. | ----------------------------+ > + | PMD 1 | | | | | | | +-----------+ | > + | | | | | | | | | .. | ----------------------------+ > + | | | | | | | | +-----------+ | > + | | | | | | | | | .. | ----------------------------+ > + | | | | | | | | +-----------+ | > + | | | | | | | | | 4095 | ----------------------------+ > + +-----------+ | | | | | | +-----------+ > + | PMD 2 | ----+ | | | | | > + +-----------+ | | | | | > + | PMD 3 | ------+ | | | | > + +-----------+ | | | | > + | PMD 4 | --------+ | | | > + +-----------+ | | | > + | PMD 5 | ----------+ | | > + +-----------+ | | > + | PMD 6 | ------------+ | > + +-----------+ | > + | PMD 7 | --------------+ > + +-----------+ > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 5e3e153ddd3d..e9dc3e2de7be 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3088,7 +3088,8 @@ struct page * __populate_section_memmap(unsigned long pfn, > pgd_t *vmemmap_pgd_populate(unsigned long addr, int node); > p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node); > pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node); > -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node); > +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node, > + struct page *block); > pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, > struct vmem_altmap *altmap, struct page *block); > void *vmemmap_alloc_block(unsigned long size, int node); > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > index a8de6c472999..68041ca9a797 100644 > --- a/mm/sparse-vmemmap.c > +++ b/mm/sparse-vmemmap.c > @@ -537,13 +537,22 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node) > return p; > } > > -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node) > +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node, > + struct page *block) > { > pmd_t *pmd = pmd_offset(pud, addr); > if (pmd_none(*pmd)) { > - void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); > - if (!p) > - return NULL; > + void *p; > + > + if (!block) { > + p = vmemmap_alloc_block_zero(PAGE_SIZE, node); > + if (!p) > + return NULL; > + } else { > + /* See comment in vmemmap_pte_populate(). */ > + get_page(block); > + p = page_to_virt(block); > + } > pmd_populate_kernel(&init_mm, pmd, p); > } > return pmd; > @@ -585,15 +594,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node) > return pgd; > } > > -static int __meminit vmemmap_populate_address(unsigned long addr, int node, > - struct vmem_altmap *altmap, > - struct page *reuse, struct page **page) > +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node, > + struct vmem_altmap *altmap, > + struct page *reuse, pmd_t **ptr) > { > pgd_t *pgd; > p4d_t *p4d; > pud_t *pud; > pmd_t *pmd; > - pte_t *pte; > > pgd = vmemmap_pgd_populate(addr, node); > if (!pgd) > @@ -604,9 +612,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node, > pud = vmemmap_pud_populate(p4d, addr, node); > if (!pud) > return -ENOMEM; > - pmd = vmemmap_pmd_populate(pud, addr, node); > + pmd = vmemmap_pmd_populate(pud, addr, node, reuse); > if (!pmd) > return -ENOMEM; > + if (ptr) > + *ptr = pmd; > + return 0; > +} > + > +static int __meminit vmemmap_populate_address(unsigned long addr, int node, > + struct vmem_altmap *altmap, > + struct page *reuse, struct page **page) > +{ > + pmd_t *pmd; > + pte_t *pte; > + > + if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd)) > + return -ENOMEM; > + > pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse); > if (!pte) > return -ENOMEM; > @@ -650,6 +673,20 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node, > return vmemmap_populate_address(addr, node, NULL, NULL, page); > } > > +static int __meminit vmemmap_populate_pmd_range(unsigned long start, > + unsigned long end, > + int node, struct page *page) > +{ > + unsigned long addr = start; > + > + for (; addr < end; addr += PMD_SIZE) { > + if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL)) > + return -ENOMEM; > + } > + > + return 0; > +} > + > static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, > unsigned long start, > unsigned long end, int node, > @@ -670,6 +707,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, > offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start; > if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) && > pgmap_geometry(pgmap) > SUBSECTION_SIZE) { > + pmd_t *pmdp; > pte_t *ptep; > > addr = start - PAGE_SIZE; > @@ -681,11 +719,25 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, > * the previous struct pages are mapped when trying to lookup > * the last tail page. > */ > - ptep = pte_offset_kernel(pmd_off_k(addr), addr); > - if (!ptep) > + pmdp = pmd_off_k(addr); > + if (!pmdp) > + return -ENOMEM; > + > + /* > + * Reuse the tail pages vmemmap pmd page > + * See layout diagram in Documentation/vm/vmemmap_dedup.rst > + */ > + if (offset % pgmap_geometry(pgmap) > PFN_PHYS(PAGES_PER_SECTION)) > + return vmemmap_populate_pmd_range(start, end, node, > + pmd_page(*pmdp)); > + > + /* See comment above when pmd_off_k() is called. */ > + ptep = pte_offset_kernel(pmdp, addr); > + if (pte_none(*ptep)) > return -ENOMEM; > > /* > + * Populate the tail pages vmemmap pmd page. > * Reuse the page that was populated in the prior iteration > * with just tail struct pages. > */ > -- > 2.17.1 >