On 7/28/21 9:03 PM, Dan Williams wrote: > On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: >> >> Currently, for compound PUD mappings, the implementation consumes 40MB >> per TB but it can be optimized to 16MB per TB with the approach >> detailed below. >> >> Right now basepages are used to populate the PUD tail pages, and it >> picks the address of the previous page of the subsection that precedes >> the memmap being initialized. This is done when a given memmap >> address isn't aligned to the pgmap @geometry (which is safe to do because >> @ranges are guaranteed to be aligned to @geometry). >> >> For pagemaps with an align which spans various sections, this means >> that PMD pages are unnecessarily allocated for reusing the same tail >> pages. Effectively, on x86 a PUD can span 8 sections (depending on >> config), and a page is being allocated a page for the PMD to reuse >> the tail vmemmap across the rest of the PTEs. In short effecitvely the >> PMD cover the tail vmemmap areas all contain the same PFN. So instead >> of doing this way, populate a new PMD on the second section of the >> compound page (tail vmemmap PMD), and then the following sections >> utilize the preceding PMD previously populated which only contain >> tail pages). >> >> After this scheme for an 1GB pagemap aligned area, the first PMD >> (section) would contain head page and 32767 tail pages, where the >> second PMD contains the full 32768 tail pages. The latter page gets >> its PMD reused across future section mapping of the same pagemap. >> >> Besides fewer pagetable entries allocated, keeping parity with >> hugepages in the directmap (as done by vmemmap_populate_hugepages()), >> this further increases savings per compound page. Rather than >> requiring 8 PMD page allocations only need 2 (plus two base pages >> allocated for head and tail areas for the first PMD). 2M pages still >> require using base pages, though. > > This looks good to me now, modulo the tail_page helper discussed > previously. Thanks for the diagram, makes it clearer what's happening. > > I don't see any red flags that would prevent a reviewed-by when you > send the next spin. > Cool, thanks! >> >> Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> >> --- >> Documentation/vm/vmemmap_dedup.rst | 109 +++++++++++++++++++++++++++++ >> include/linux/mm.h | 3 +- >> mm/sparse-vmemmap.c | 74 +++++++++++++++++--- >> 3 files changed, 174 insertions(+), 12 deletions(-) >> >> diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst >> index 42830a667c2a..96d9f5f0a497 100644 >> --- a/Documentation/vm/vmemmap_dedup.rst >> +++ b/Documentation/vm/vmemmap_dedup.rst >> @@ -189,3 +189,112 @@ at a later stage when we populate the sections. >> It only use 3 page structs for storing all information as opposed >> to 4 on HugeTLB pages. This does not affect memory savings between both. >> >> +Additionally, it further extends the tail page deduplication with 1GB >> +device-dax compound pages. >> + >> +E.g.: A 1G device-dax page on x86_64 consists in 4096 page frames, split >> +across 8 PMD page frames, with the first PMD having 2 PTE page frames. >> +In total this represents a total of 40960 bytes per 1GB page. >> + >> +Here is how things look after the previously described tail page deduplication >> +technique. >> + >> + device-dax page frames struct pages(4096 pages) page frame(2 pages) >> + +-----------+ -> +----------+ --> +-----------+ mapping to +-------------+ >> + | | | 0 | | 0 | -------------> | 0 | >> + | | +----------+ +-----------+ +-------------+ >> + | | | 1 | -------------> | 1 | >> + | | +-----------+ +-------------+ >> + | | | 2 | ----------------^ ^ ^ ^ ^ ^ ^ >> + | | +-----------+ | | | | | | >> + | | | 3 | ------------------+ | | | | | >> + | | +-----------+ | | | | | >> + | | | 4 | --------------------+ | | | | >> + | PMD 0 | +-----------+ | | | | >> + | | | 5 | ----------------------+ | | | >> + | | +-----------+ | | | >> + | | | .. | ------------------------+ | | >> + | | +-----------+ | | >> + | | | 511 | --------------------------+ | >> + | | +-----------+ | >> + | | | >> + | | | >> + | | | >> + +-----------+ page frames | >> + +-----------+ -> +----------+ --> +-----------+ mapping to | >> + | | | 1 .. 7 | | 512 | ----------------------------+ >> + | | +----------+ +-----------+ | >> + | | | .. | ----------------------------+ >> + | | +-----------+ | >> + | | | .. | ----------------------------+ >> + | | +-----------+ | >> + | | | .. | ----------------------------+ >> + | | +-----------+ | >> + | | | .. | ----------------------------+ >> + | PMD | +-----------+ | >> + | 1 .. 7 | | .. | ----------------------------+ >> + | | +-----------+ | >> + | | | .. | ----------------------------+ >> + | | +-----------+ | >> + | | | 4095 | ----------------------------+ >> + +-----------+ +-----------+ >> + >> +Page frames of PMD 1 through 7 are allocated and mapped to the same PTE page frame >> +that contains stores tail pages. As we can see in the diagram, PMDs 1 through 7 >> +all look like the same. Therefore we can map PMD 2 through 7 to PMD 1 page frame. >> +This allows to free 6 vmemmap pages per 1GB page, decreasing the overhead per >> +1GB page from 40960 bytes to 16384 bytes. >> + >> +Here is how things look after PMD tail page deduplication. >> + >> + device-dax page frames struct pages(4096 pages) page frame(2 pages) >> + +-----------+ -> +----------+ --> +-----------+ mapping to +-------------+ >> + | | | 0 | | 0 | -------------> | 0 | >> + | | +----------+ +-----------+ +-------------+ >> + | | | 1 | -------------> | 1 | >> + | | +-----------+ +-------------+ >> + | | | 2 | ----------------^ ^ ^ ^ ^ ^ ^ >> + | | +-----------+ | | | | | | >> + | | | 3 | ------------------+ | | | | | >> + | | +-----------+ | | | | | >> + | | | 4 | --------------------+ | | | | >> + | PMD 0 | +-----------+ | | | | >> + | | | 5 | ----------------------+ | | | >> + | | +-----------+ | | | >> + | | | .. | ------------------------+ | | >> + | | +-----------+ | | >> + | | | 511 | --------------------------+ | >> + | | +-----------+ | >> + | | | >> + | | | >> + | | | >> + +-----------+ page frames | >> + +-----------+ -> +----------+ --> +-----------+ mapping to | >> + | | | 1 | | 512 | ----------------------------+ >> + | | +----------+ +-----------+ | >> + | | ^ ^ ^ ^ ^ ^ | .. | ----------------------------+ >> + | | | | | | | | +-----------+ | >> + | | | | | | | | | .. | ----------------------------+ >> + | | | | | | | | +-----------+ | >> + | | | | | | | | | .. | ----------------------------+ >> + | | | | | | | | +-----------+ | >> + | | | | | | | | | .. | ----------------------------+ >> + | PMD 1 | | | | | | | +-----------+ | >> + | | | | | | | | | .. | ----------------------------+ >> + | | | | | | | | +-----------+ | >> + | | | | | | | | | .. | ----------------------------+ >> + | | | | | | | | +-----------+ | >> + | | | | | | | | | 4095 | ----------------------------+ >> + +-----------+ | | | | | | +-----------+ >> + | PMD 2 | ----+ | | | | | >> + +-----------+ | | | | | >> + | PMD 3 | ------+ | | | | >> + +-----------+ | | | | >> + | PMD 4 | --------+ | | | >> + +-----------+ | | | >> + | PMD 5 | ----------+ | | >> + +-----------+ | | >> + | PMD 6 | ------------+ | >> + +-----------+ | >> + | PMD 7 | --------------+ >> + +-----------+ >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 5e3e153ddd3d..e9dc3e2de7be 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -3088,7 +3088,8 @@ struct page * __populate_section_memmap(unsigned long pfn, >> pgd_t *vmemmap_pgd_populate(unsigned long addr, int node); >> p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node); >> pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node); >> -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node); >> +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node, >> + struct page *block); >> pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, >> struct vmem_altmap *altmap, struct page *block); >> void *vmemmap_alloc_block(unsigned long size, int node); >> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c >> index a8de6c472999..68041ca9a797 100644 >> --- a/mm/sparse-vmemmap.c >> +++ b/mm/sparse-vmemmap.c >> @@ -537,13 +537,22 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node) >> return p; >> } >> >> -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node) >> +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node, >> + struct page *block) >> { >> pmd_t *pmd = pmd_offset(pud, addr); >> if (pmd_none(*pmd)) { >> - void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); >> - if (!p) >> - return NULL; >> + void *p; >> + >> + if (!block) { >> + p = vmemmap_alloc_block_zero(PAGE_SIZE, node); >> + if (!p) >> + return NULL; >> + } else { >> + /* See comment in vmemmap_pte_populate(). */ >> + get_page(block); >> + p = page_to_virt(block); >> + } >> pmd_populate_kernel(&init_mm, pmd, p); >> } >> return pmd; >> @@ -585,15 +594,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node) >> return pgd; >> } >> >> -static int __meminit vmemmap_populate_address(unsigned long addr, int node, >> - struct vmem_altmap *altmap, >> - struct page *reuse, struct page **page) >> +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node, >> + struct vmem_altmap *altmap, >> + struct page *reuse, pmd_t **ptr) >> { >> pgd_t *pgd; >> p4d_t *p4d; >> pud_t *pud; >> pmd_t *pmd; >> - pte_t *pte; >> >> pgd = vmemmap_pgd_populate(addr, node); >> if (!pgd) >> @@ -604,9 +612,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node, >> pud = vmemmap_pud_populate(p4d, addr, node); >> if (!pud) >> return -ENOMEM; >> - pmd = vmemmap_pmd_populate(pud, addr, node); >> + pmd = vmemmap_pmd_populate(pud, addr, node, reuse); >> if (!pmd) >> return -ENOMEM; >> + if (ptr) >> + *ptr = pmd; >> + return 0; >> +} >> + >> +static int __meminit vmemmap_populate_address(unsigned long addr, int node, >> + struct vmem_altmap *altmap, >> + struct page *reuse, struct page **page) >> +{ >> + pmd_t *pmd; >> + pte_t *pte; >> + >> + if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd)) >> + return -ENOMEM; >> + >> pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse); >> if (!pte) >> return -ENOMEM; >> @@ -650,6 +673,20 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node, >> return vmemmap_populate_address(addr, node, NULL, NULL, page); >> } >> >> +static int __meminit vmemmap_populate_pmd_range(unsigned long start, >> + unsigned long end, >> + int node, struct page *page) >> +{ >> + unsigned long addr = start; >> + >> + for (; addr < end; addr += PMD_SIZE) { >> + if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL)) >> + return -ENOMEM; >> + } >> + >> + return 0; >> +} >> + >> static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, >> unsigned long start, >> unsigned long end, int node, >> @@ -670,6 +707,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, >> offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start; >> if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) && >> pgmap_geometry(pgmap) > SUBSECTION_SIZE) { >> + pmd_t *pmdp; >> pte_t *ptep; >> >> addr = start - PAGE_SIZE; >> @@ -681,11 +719,25 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, >> * the previous struct pages are mapped when trying to lookup >> * the last tail page. >> */ >> - ptep = pte_offset_kernel(pmd_off_k(addr), addr); >> - if (!ptep) >> + pmdp = pmd_off_k(addr); >> + if (!pmdp) >> + return -ENOMEM; >> + >> + /* >> + * Reuse the tail pages vmemmap pmd page >> + * See layout diagram in Documentation/vm/vmemmap_dedup.rst >> + */ >> + if (offset % pgmap_geometry(pgmap) > PFN_PHYS(PAGES_PER_SECTION)) >> + return vmemmap_populate_pmd_range(start, end, node, >> + pmd_page(*pmdp)); >> + >> + /* See comment above when pmd_off_k() is called. */ >> + ptep = pte_offset_kernel(pmdp, addr); >> + if (pte_none(*ptep)) >> return -ENOMEM; >> >> /* >> + * Populate the tail pages vmemmap pmd page. >> * Reuse the page that was populated in the prior iteration >> * with just tail struct pages. >> */ >> -- >> 2.17.1 >>