On Tue, 17 Jun 2014 02:32:51 +0000 Atsushi Kumagai <kumagai-atsushi at mxc.nes.nec.co.jp> wrote: > Hello, > > This is the v2 patch for hugepage filtering rebased on Petr's > "Generic multi-page exclusion", thanks Petr. > > The current kernel's VMCOREINFO doesn't include necessary values > for this patch, so we need "-x vmlinux" to enable hugepage filtering. > I tested this patch on kernel 3.13. > > Regarding this, Petr made effort to add the values to VMCOREINFO, > but it looks suspended: > > https://lkml.org/lkml/2014/4/11/349 Actually, I received an Acked-by from Vivek last Wednesday. Oh, wait a moment, this email went to Andrew Morton, but not to any mailing list. :-( > So, we should resume this discussion for this patch. Then, > I should modify this patch to use PG_head_mask if it's accepted. I have already experimented with hugepage filtering, but haven't sent my patches yet, precisely because they depend on a not-yet-confirmed feature in the kernel. Anyway, let's take your patch as base. I'll add my comments where I believe my approach was better/cleaner. > Thanks > Atsushi Kumagai > > > From: Atsushi Kumagai <kumagai-atsushi at mxc.nes.nec.co.jp> > Date: Tue, 17 Jun 2014 08:59:44 +0900 > Subject: [PATCH v2] Exclude unnecessary hugepages. > > There are 2 types of hugepages in the kernel, the both should be > excluded as user pages. > > 1. Transparent huge pages (THP) > All the pages are anonymous pages (at least for now), so we should > just get how many pages are in the corresponding hugepage. > It can be gotten from the page->lru.prev of the second page in the > hugepage. > > 2. Hugetlbfs pages > The pages aren't anonymous pages but kind of user pages, we should > exclude also these pages in any way. > Luckily, it's possible to detect these pages by looking the > page->lru.next of the second page in the hugepage. This idea came > from the kernel's PageHuge(). Good point! My patch didn't take care of hugetlbfs pages. > The number of pages can be gotten in the same way as THP. > > Signed-off-by: Atsushi Kumagai <kumagai-atsushi at mxc.nes.nec.co.jp> > --- > makedumpfile.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--- > makedumpfile.h | 8 ++++ > 2 files changed, 120 insertions(+), 5 deletions(-) > > diff --git a/makedumpfile.c b/makedumpfile.c > index 34db997..d170c54 100644 > --- a/makedumpfile.c > +++ b/makedumpfile.c > @@ -92,6 +92,7 @@ do { \ > } while (0) > > static void setup_page_is_buddy(void); > +static void setup_page_is_hugepage(void); > > void > initialize_tables(void) > @@ -328,6 +329,18 @@ update_mmap_range(off_t offset, int initial) { > } > > static int > +page_is_hugepage(unsigned long flags) { > + if (NUMBER(PG_head) != NOT_FOUND_NUMBER) { > + return isHead(flags); > + } else if (NUMBER(PG_tail) != NOT_FOUND_NUMBER) { > + return isTail(flags); > + }if (NUMBER(PG_compound) != NOT_FOUND_NUMBER) { > + return isCompound(flags); > + } > + return 0; > +} > + > +static int Since it looks like we'll get the mask in VMCOREINFO, I'd rather use the mask, and construct it from PG_* flags if there's no VMCOREINFO. I add long PG_head_mask to struct number_table, define this macro: #define isCompoundHead(flags) (!!((flags) & NUMBER(PG_head_mask))) Then I initialize it in get_structure_info like this: PG_head = get_enum_number("PG_head"); if (PG_head == FAILED_DWARFINFO) { PG_head = get_enum_number("PG_compound"); if (PG_head == FAILED_DWARFINFO) return FALSE; } NUMBER(PG_head_mask) = 1L << PG_head; with a fallback in get_value_for_old_linux: if (NUMBER(PG_head_mask) == NOT_FOUND_NUMBER) NUMBER(PG_head_mask) = 1L << PG_compound_ORIGINAL; Also, I prefer to write PG_head_mask to the makedumpfile-generated VMCOREINFO, so it will have the same fields as the kernel-generated VMCOREINFO. > is_mapped_with_mmap(off_t offset) { > > if (info->flag_usemmap == MMAP_ENABLE > @@ -1180,6 +1193,7 @@ get_symbol_info(void) > SYMBOL_INIT(vmemmap_list, "vmemmap_list"); > SYMBOL_INIT(mmu_psize_defs, "mmu_psize_defs"); > SYMBOL_INIT(mmu_vmemmap_psize, "mmu_vmemmap_psize"); > + SYMBOL_INIT(free_huge_page, "free_huge_page"); > > return TRUE; > } > @@ -1288,11 +1302,19 @@ get_structure_info(void) > > ENUM_NUMBER_INIT(PG_lru, "PG_lru"); > ENUM_NUMBER_INIT(PG_private, "PG_private"); > + ENUM_NUMBER_INIT(PG_head, "PG_head"); > + ENUM_NUMBER_INIT(PG_tail, "PG_tail"); > + ENUM_NUMBER_INIT(PG_compound, "PG_compound"); > ENUM_NUMBER_INIT(PG_swapcache, "PG_swapcache"); > ENUM_NUMBER_INIT(PG_buddy, "PG_buddy"); > ENUM_NUMBER_INIT(PG_slab, "PG_slab"); > ENUM_NUMBER_INIT(PG_hwpoison, "PG_hwpoison"); > > + if (NUMBER(PG_head) == NOT_FOUND_NUMBER && > + NUMBER(PG_compound) == NOT_FOUND_NUMBER) > + /* Pre-2.6.26 kernels did not have pageflags */ > + NUMBER(PG_compound) = PG_compound_ORIGINAL; > + > ENUM_TYPE_SIZE_INIT(pageflags, "pageflags"); > > TYPEDEF_SIZE_INIT(nodemask_t, "nodemask_t"); > @@ -1694,6 +1716,7 @@ write_vmcoreinfo_data(void) > WRITE_SYMBOL("vmemmap_list", vmemmap_list); > WRITE_SYMBOL("mmu_psize_defs", mmu_psize_defs); > WRITE_SYMBOL("mmu_vmemmap_psize", mmu_vmemmap_psize); > + WRITE_SYMBOL("free_huge_page", free_huge_page); > > /* > * write the structure size of 1st kernel > @@ -1783,6 +1806,9 @@ write_vmcoreinfo_data(void) > > WRITE_NUMBER("PG_lru", PG_lru); > WRITE_NUMBER("PG_private", PG_private); > + WRITE_NUMBER("PG_head", PG_head); > + WRITE_NUMBER("PG_tail", PG_tail); > + WRITE_NUMBER("PG_compound", PG_compound); > WRITE_NUMBER("PG_swapcache", PG_swapcache); > WRITE_NUMBER("PG_buddy", PG_buddy); > WRITE_NUMBER("PG_slab", PG_slab); > @@ -2033,6 +2059,7 @@ read_vmcoreinfo(void) > READ_SYMBOL("vmemmap_list", vmemmap_list); > READ_SYMBOL("mmu_psize_defs", mmu_psize_defs); > READ_SYMBOL("mmu_vmemmap_psize", mmu_vmemmap_psize); > + READ_SYMBOL("free_huge_page", free_huge_page); > > READ_STRUCTURE_SIZE("page", page); > READ_STRUCTURE_SIZE("mem_section", mem_section); > @@ -2109,6 +2136,9 @@ read_vmcoreinfo(void) > > READ_NUMBER("PG_lru", PG_lru); > READ_NUMBER("PG_private", PG_private); > + READ_NUMBER("PG_head", PG_head); > + READ_NUMBER("PG_tail", PG_tail); > + READ_NUMBER("PG_compound", PG_compound); > READ_NUMBER("PG_swapcache", PG_swapcache); > READ_NUMBER("PG_slab", PG_slab); > READ_NUMBER("PG_buddy", PG_buddy); > @@ -3283,6 +3313,9 @@ out: > if (!get_value_for_old_linux()) > return FALSE; > > + /* Get page flags for compound pages */ > + setup_page_is_hugepage(); > + > /* use buddy identification of free pages whether cyclic or not */ > /* (this can reduce pages scan of 1TB memory from 60sec to 30sec) */ > if (info->dump_level & DL_EXCLUDE_FREE) > @@ -4346,6 +4379,24 @@ out: > "follow free lists instead of mem_map array.\n"); > } > > +static void > +setup_page_is_hugepage(void) > +{ > + if (NUMBER(PG_head) != NOT_FOUND_NUMBER) { > + if (NUMBER(PG_tail) == NOT_FOUND_NUMBER) { > + /* > + * If PG_tail is not explicitly saved, then assume > + * that it immediately follows PG_head. > + */ > + NUMBER(PG_tail) = NUMBER(PG_head) + 1; > + } > + } else if ((NUMBER(PG_compound) != NOT_FOUND_NUMBER) > + && (info->dump_level & DL_EXCLUDE_USER_DATA)) { > + MSG("Compound page bit could not be determined: "); > + MSG("huge pages will NOT be filtered.\n"); > + } > +} > + > /* > * If using a dumpfile in kdump-compressed format as a source file > * instead of /proc/vmcore, 1st-bitmap of a new dumpfile must be > @@ -4660,8 +4711,9 @@ __exclude_unnecessary_pages(unsigned long mem_map, > mdf_pfn_t pfn_read_start, pfn_read_end; > unsigned char page_cache[SIZE(page) * PGMM_CACHED]; > unsigned char *pcache; > - unsigned int _count, _mapcount = 0; > + unsigned int _count, _mapcount = 0, compound_order = 0; > unsigned long flags, mapping, private = 0; > + unsigned long hugetlb_dtor; > > /* > * If a multi-page exclusion is pending, do it first > @@ -4727,6 +4779,27 @@ __exclude_unnecessary_pages(unsigned long mem_map, > flags = ULONG(pcache + OFFSET(page.flags)); > _count = UINT(pcache + OFFSET(page._count)); > mapping = ULONG(pcache + OFFSET(page.mapping)); > + > + if (index_pg < PGMM_CACHED - 1) { > + compound_order = ULONG(pcache + SIZE(page) + OFFSET(page.lru) > + + OFFSET(list_head.prev)); > + hugetlb_dtor = ULONG(pcache + SIZE(page) + OFFSET(page.lru) > + + OFFSET(list_head.next)); > + } else if (pfn + 1 < pfn_end) { AFAICS this clause is not needed. All compound pages are aligned to its page order, e.g. the head page of an order-2 compound page is aligned to a multiple of 4. Since mem_map cache is aligned to PGMM_CACHED, which is defined as 512 (that is power of 2), a compound page cannot possibly start on the last PFN of the cache. I even added a sanity check for the alignment: if (order && order < sizeof(unsigned long) * 8 && (pfn & ((1UL << order) - 1)) == 0) Ok, the "order" above corresponds to your "compound_order"... > + unsigned char page_cache_next[SIZE(page)]; > + if (!readmem(VADDR, mem_map, page_cache_next, SIZE(page))) { > + ERRMSG("Can't read the buffer of struct page.\n"); > + return FALSE; > + } > + compound_order = ULONG(page_cache_next + OFFSET(page.lru) > + + OFFSET(list_head.prev)); > + hugetlb_dtor = ULONG(page_cache_next + OFFSET(page.lru) > + + OFFSET(list_head.next)); > + } else { > + compound_order = 0; > + hugetlb_dtor = 0; > + } > + > if (OFFSET(page._mapcount) != NOT_FOUND_STRUCTURE) > _mapcount = UINT(pcache + OFFSET(page._mapcount)); > if (OFFSET(page.private) != NOT_FOUND_STRUCTURE) > @@ -4754,6 +4827,10 @@ __exclude_unnecessary_pages(unsigned long mem_map, > && !isPrivate(flags) && !isAnon(mapping)) { > if (clear_bit_on_2nd_bitmap_for_kernel(pfn, cycle)) > pfn_cache++; > + /* > + * NOTE: If THP for cache is introduced, the check for > + * compound pages is needed here. > + */ I do this differently. I added: mdf_pfn_t *pfn_counter Then I set pfn_counter to the appropriate counter, but do not call clear_bit_on_2nd_bitmap_for_kernel(). At the end of the long if-else-if-else-if statement I add a final else-clause: /* * Page not excluded */ else continue; If execution gets here, the page is excluded, so I can do: if (nr_pages == 1) { if (clear_bit_on_2nd_bitmap_for_kernel(pfn, cycle)) (*pfn_counter)++; } else { exclude_range(pfn_counter, pfn, pfn + nr_pages, cycle); } What do you think? Petr Tesarik