At 11/20/2012 11:47 AM, Jaegeuk Hanse Wrote: > On 11/19/2012 05:11 PM, Jiang Liu wrote: >> On 2012-11-19 17:12, Wen Congyang wrote: >>> At 11/19/2012 04:45 PM, Bob Liu Wrote: >>>> zone->present_pages shoule be: >>>> spanned pages - absent pages - bootmem pages(including memmap pages), >>>> but now it's: >>>> spanned pages - absent pages - memmap pages. >>>> And it didn't consider whether the memmap pages is actully allocated >>>> from the >>>> zone or not which may cause problem when memory hotplug is improved >>>> recently. >>>> >>>> For example: >>>> numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other >>>> bootmem >>>> allocated from ZONE_MOVABLE. >>>> So ZONE_NORMAL's present_pages should be spanned pages - absent >>>> pages, but now >>>> it also minus memmap pages, which are actually allocated from >>>> ZONE_MOVABLE. >>>> This is wrong and when offlining all memory of this zone: >>>> (zone->present_pages -= offline_pages) will less than 0. >>>> Since present_pages is unsigned long type, that is actually a very >>>> large >>>> integer which will cause zone->watermark[WMARK_MIN] becomes a large >>>> integer too(see setup_per_zone_wmarks()). >>>> As a result, totalreserve_pages become a large integer also and >>>> finally memory >>>> allocating will fail in __vm_enough_memory(). >>>> >>>> Related discuss: >>>> http://lkml.org/lkml/2012/11/5/866 >>>> https://patchwork.kernel.org/patch/1346751/ >>>> >>>> Related patches in mmotm: >>>> mm: fix-up zone present pages(7f1290f2f2a4d2c) (sometimes cause >>>> egression) >>>> mm: fix a regression with HIGHMEM(fe2cebd5a259eec) (Andrew have some >>>> feedback) >>>> >>>> Jiang Liu have sent a series patches to fix this issue by adding a >>>> managed_pages area to zone struct: >>>> [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages >>>> >>>> But i think it's too complicated. >>>> Mine is based on the two related patches already in mmotm(need to >>>> revert them >>>> first) >>>> It fix the calculation of zone->present_pages by: >>>> 1. Reset the zone->present_pages to zero before >>>> free_all_bootmem(),free_all_bootmem_node() and >>>> free_low_memory_core_early(). >>>> I think these should already included all path in all arch. >>>> >>>> 2. If there is a page freed to buddy system in __free_pages_bootmem(), >>>> add zone->present_pages accrodingly. >>>> >>>> Note this patch assumes that bootmem won't use memory above >>>> ZONE_HIGHMEM, so >>> Hmm, on x86_64 box, bootmem uses the memory in ZONE_MOVABLE and >>> ZONE_MOVABLE > ZONE_HIGHMEM. >> That's an issue, I'm trying to avoid allocating bootmem from >> ZONE_MOVABLE. >> And is_highmem(z) or is_highmem_idx(zoneidx) could be safely used to >> distinguish >> movable highmem zones. > Hi Jiang, > > - I'm not sure why the above mentioned bootmem use the memory in > ZONE_MOVABLE, IIUR, because current nobootmem/memblock logic will alloc > pages from highest available memory to lowest available memory to avoid > fragmentation. Is it correct? > - why need avoid allocating bootmem from ZONE_MOVABLE? ZONE_MOVABLE means that, the memory can be moved to the other place. So we can offline it. It is very useful for memory hotplug. Thanks Wen Congyang > > Regards, > Jaejeuk >> >>>> only zones below ZONE_HIGHMEM are reset/fixed. If not, some update >>>> is needed. >>>> For ZONE_HIGHMEM, only fix it's init value to: >>>> panned_pages - absent_pages in free_area_init_core(). >>>> >>>> Only did some simple test currently. >>>> >>>> Signed-off-by: Jianguo Wu <wujianguo@xxxxxxxxxx> >>>> Signed-off-by: Jiang Liu <jiang.liu@xxxxxxxxxx> >>>> Signed-off-by: Bob Liu <lliubbo@xxxxxxxxx> >>>> --- >>>> include/linux/mm.h | 3 +++ >>>> mm/bootmem.c | 2 ++ >>>> mm/nobootmem.c | 1 + >>>> mm/page_alloc.c | 49 >>>> +++++++++++++++++++++++++++++++++++++------------ >>>> 4 files changed, 43 insertions(+), 12 deletions(-) >>>> >>>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>>> index 7b03cab..3b40eb6 100644 >>>> --- a/include/linux/mm.h >>>> +++ b/include/linux/mm.h >>>> @@ -1763,5 +1763,8 @@ static inline unsigned int >>>> debug_guardpage_minorder(void) { return 0; } >>>> static inline bool page_is_guard(struct page *page) { return false; } >>>> #endif /* CONFIG_DEBUG_PAGEALLOC */ >>>> +extern void reset_lowmem_zone_present_pages_pernode(pg_data_t >>>> *pgdat); >>>> +extern void reset_lowmem_zone_present_pages(void); >>>> + >>>> #endif /* __KERNEL__ */ >>>> #endif /* _LINUX_MM_H */ >>>> diff --git a/mm/bootmem.c b/mm/bootmem.c >>>> index 26d057a..661775b 100644 >>>> --- a/mm/bootmem.c >>>> +++ b/mm/bootmem.c >>>> @@ -238,6 +238,7 @@ static unsigned long __init >>>> free_all_bootmem_core(bootmem_data_t *bdata) >>>> unsigned long __init free_all_bootmem_node(pg_data_t *pgdat) >>>> { >>>> register_page_bootmem_info_node(pgdat); >>>> + reset_lowmem_zone_present_pages_pernode(pgdat); >>>> return free_all_bootmem_core(pgdat->bdata); >>>> } >>>> @@ -251,6 +252,7 @@ unsigned long __init free_all_bootmem(void) >>>> unsigned long total_pages = 0; >>>> bootmem_data_t *bdata; >>>> + reset_lowmem_zone_present_pages(); >>>> list_for_each_entry(bdata, &bdata_list, list) >>>> total_pages += free_all_bootmem_core(bdata); >>>> diff --git a/mm/nobootmem.c b/mm/nobootmem.c >>>> index bd82f6b..378d50a 100644 >>>> --- a/mm/nobootmem.c >>>> +++ b/mm/nobootmem.c >>>> @@ -126,6 +126,7 @@ unsigned long __init >>>> free_low_memory_core_early(int nodeid) >>>> phys_addr_t start, end, size; >>>> u64 i; >>>> + reset_lowmem_zone_present_pages(); >>>> for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) >>>> count += __free_memory_core(start, end); >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index 07425a7..76d37f0 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -735,6 +735,7 @@ void __meminit __free_pages_bootmem(struct page >>>> *page, unsigned int order) >>>> { >>>> unsigned int nr_pages = 1 << order; >>>> unsigned int loop; >>>> + struct zone *zone; >>>> prefetchw(page); >>>> for (loop = 0; loop < nr_pages; loop++) { >>>> @@ -748,6 +749,9 @@ void __meminit __free_pages_bootmem(struct page >>>> *page, unsigned int order) >>>> set_page_refcounted(page); >>>> __free_pages(page, order); >>>> + zone = page_zone(page); >>>> + WARN_ON(!(is_normal(zone) || is_dma(zone) || is_dma32(zone))); >>>> + zone->present_pages += nr_pages; >>>> } >>>> #ifdef CONFIG_CMA >>>> @@ -4547,18 +4551,20 @@ static void __paginginit >>>> free_area_init_core(struct pglist_data *pgdat, >>>> * is used by this zone for memmap. This affects the >>>> watermark >>>> * and per-cpu initialisations >>>> */ >>>> - memmap_pages = >>>> - PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT; >>>> - if (realsize >= memmap_pages) { >>>> - realsize -= memmap_pages; >>>> - if (memmap_pages) >>>> - printk(KERN_DEBUG >>>> - " %s zone: %lu pages used for memmap\n", >>>> - zone_names[j], memmap_pages); >>>> - } else >>>> - printk(KERN_WARNING >>>> - " %s zone: %lu pages exceeds realsize %lu\n", >>>> - zone_names[j], memmap_pages, realsize); >>>> + if (j < ZONE_HIGHMEM) { >>>> + memmap_pages = >>>> + PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT; >>>> + if (realsize >= memmap_pages) { >>>> + realsize -= memmap_pages; >>>> + if (memmap_pages) >>>> + printk(KERN_DEBUG >>>> + " %s zone: %lu pages used for memmap\n", >>>> + zone_names[j], memmap_pages); >>>> + } else >>>> + printk(KERN_WARNING >>>> + " %s zone: %lu pages exceeds realsize %lu\n", >>>> + zone_names[j], memmap_pages, realsize); >>>> + } >>>> /* Account for reserved pages */ >>>> if (j == 0 && realsize > dma_reserve) { >>>> @@ -6143,3 +6149,22 @@ void dump_page(struct page *page) >>>> dump_page_flags(page->flags); >>>> mem_cgroup_print_bad_page(page); >>>> } >>>> + >>>> +/* reset zone->present_pages to 0 for zones below ZONE_HIGHMEM */ >>>> +void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat) >>>> +{ >>>> + int i; >>>> + struct zone *z; >>>> + for (i = 0; i < ZONE_HIGHMEM; i++) { >>> And if CONFIG_HIGHMEM is no, ZONE_NORMAL == ZONE_HIGHMEM. >>> >>> So, you don't reset ZONE_NORMAL here. >>> >>>> + z = pgdat->node_zones + i; >>>> + z->present_pages = 0; >>>> + } >>>> +} >>>> + >>>> +void reset_lowmem_zone_present_pages(void) >>>> +{ >>>> + int nid; >>>> + >>>> + for_each_node_state(nid, N_HIGH_MEMORY) >>>> + reset_lowmem_zone_present_pages_pernode(NODE_DATA(nid)); >>>> +} >>> >>> . >>> >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@xxxxxxxxx. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>