On Tue, Dec 02, 2014 at 04:46:33PM +0100, Vlastimil Babka wrote: > On 12/02/2014 05:53 AM, Joonsoo Kim wrote: > >On Tue, Dec 02, 2014 at 12:47:24PM +1100, Christian Marie wrote: > >>On 28.11.2014 9:03, Joonsoo Kim wrote: > >>>Hello, > >>> > >>>I didn't follow-up this discussion, but, at glance, this excessive CPU > >>>usage by compaction is related to following fixes. > >>> > >>>Could you test following two patches? > >>> > >>>If these fixes your problem, I will resumit patches with proper commit > >>>description. > >>> > >>>-------- 8< --------- > >> > >> > >>Thanks for looking into this. Running 3.18-rc5 kernel with your patches has > >>produced some interesting results. > >> > >>Load average still spikes to around 2000-3000 with the processors spinning 100% > >>doing compaction related things when min_free_kbytes is left at the default. > >> > >>However, unlike before, the system is now completely stable. Pre-patch it would > >>be almost completely unresponsive (having to wait 30 seconds to establish an > >>SSH connection and several seconds to send a character). > >> > >>Is it reasonable to guess that ipoib is giving compaction a hard time and > >>fixing this bug has allowed the system to at least not lock up? > >> > >>I will try back-porting this to 3.10 and seeing if it is stable under these > >>strange conditions also. > > > >Hello, > > > >Good to hear! > > Indeed, although I somehow doubt your first patch could have made > such difference. It only matters when you have a whole pageblock > free. Without the patch, the particular compaction attempt that > managed to free the block might not be terminated ASAP, but then the > free pageblock is still allocatable by the following allocation > attempts, so it shouldn't result in a stream of complete > compactions. High-order freepage made by compaction could be broken by other order-0 allocation attempts, so following high-order allocation attempts could result in new compaction. It would be dependent on workload. Anyway, we should fix cc->order to order. :) > > So I would expect it's either a fluke, or the second patch made the > difference, to either SLUB or something else making such > fallback-able allocations. > > But hmm, I've never considered the implications of > compact_finished() migratetypes handling on unmovable allocations. > Regardless of cc->order, it often has to free a whole pageblock to > succeed, as it's unlikely it will succeed compacting within a > pageblock already marked as UNMOVABLE. Guess it's to prevent further > fragmentation and that makes sense, but it does make high-order > unmovable allocations problematic. At least the watermark checks for > allowing compaction in the first place are then wrong - we decide > that based on cc->order, but in we fact need at least a pageblock > worth of space free to actually succeed. I think that watermark check is okay but we need a elegant way to decide the best timing compaction should be stopped. I made following two patches about this. This patch would make non-movable compaction less aggressive. This is just draft so ignore my poor description. :) Could you comment it? --------->8----------------- >From bd6b285c38fd94e5ec03a720bed4debae3914bde Mon Sep 17 00:00:00 2001 From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> Date: Mon, 1 Dec 2014 11:56:57 +0900 Subject: [PATCH 1/2] mm/page_alloc: expands broken freepage to proper buddy list when steal There is odd behaviour when we steal freepages from other migratetype buddy list. In try_to_steal_freepages(), we move all freepages in the pageblock that founded freepage is belong to to the request migratetype in order to mitigate fragmentation. If the number of moved pages are enough to change pageblock migratetype, there is no problem. If not enough, we don't change pageblock migratetype and add broken freepages to the original migratetype buddy list rather than request migratetype one. For me, this is odd, because we already moved all freepages in this pageblock to the request migratetype. This patch fixes this situation to add broken freepages to the request migratetype buddy list in this case. This patch introduce new function that can help to decide if we can steal the page without resulting in fragmentation. It will be used in following patch for compaction finish criteria. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> --- include/trace/events/kmem.h | 7 +++-- mm/page_alloc.c | 72 +++++++++++++++++++++++++------------------ 2 files changed, 46 insertions(+), 33 deletions(-) diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index aece134..31cda40 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -268,11 +268,11 @@ TRACE_EVENT(mm_page_alloc_extfrag, TP_PROTO(struct page *page, int alloc_order, int fallback_order, - int alloc_migratetype, int fallback_migratetype, int new_migratetype), + int alloc_migratetype, int fallback_migratetype), TP_ARGS(page, alloc_order, fallback_order, - alloc_migratetype, fallback_migratetype, new_migratetype), + alloc_migratetype, fallback_migratetype), TP_STRUCT__entry( __field( struct page *, page ) @@ -289,7 +289,8 @@ TRACE_EVENT(mm_page_alloc_extfrag, __entry->fallback_order = fallback_order; __entry->alloc_migratetype = alloc_migratetype; __entry->fallback_migratetype = fallback_migratetype; - __entry->change_ownership = (new_migratetype == alloc_migratetype); + __entry->change_ownership = (alloc_migratetype == + get_pageblock_migratetype(page)); ), TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d", diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 291b737..77be830 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1139,44 +1139,50 @@ static void change_pageblock_range(struct page *pageblock_page, * Returns the new migratetype of the pageblock (or the same old migratetype * if it was unchanged). */ -static int try_to_steal_freepages(struct zone *zone, struct page *page, - int start_type, int fallback_type) +static void try_to_steal_freepages(struct zone *zone, struct page *page, + int target_mt) { + int pages; int current_order = page_order(page); - /* - * When borrowing from MIGRATE_CMA, we need to release the excess - * buddy pages to CMA itself. We also ensure the freepage_migratetype - * is set to CMA so it is returned to the correct freelist in case - * the page ends up being not actually allocated from the pcp lists. - */ - if (is_migrate_cma(fallback_type)) - return fallback_type; - /* Take ownership for orders >= pageblock_order */ if (current_order >= pageblock_order) { - change_pageblock_range(page, current_order, start_type); - return start_type; + change_pageblock_range(page, current_order, target_mt); + return; } - if (current_order >= pageblock_order / 2 || - start_type == MIGRATE_RECLAIMABLE || - page_group_by_mobility_disabled) { - int pages; + pages = move_freepages_block(zone, page, target_mt); - pages = move_freepages_block(zone, page, start_type); + /* Claim the whole block if over half of it is free */ + if (pages >= (1 << (pageblock_order-1)) || + page_group_by_mobility_disabled) { - /* Claim the whole block if over half of it is free */ - if (pages >= (1 << (pageblock_order-1)) || - page_group_by_mobility_disabled) { + set_pageblock_migratetype(page, target_mt); + } +} - set_pageblock_migratetype(page, start_type); - return start_type; - } +static bool can_steal_freepages(unsigned int order, + int start_mt, int fallback_mt) +{ + /* + * When borrowing from MIGRATE_CMA, we need to release the excess + * buddy pages to CMA itself. We also ensure the freepage_migratetype + * is set to CMA so it is returned to the correct freelist in case + * the page ends up being not actually allocated from the pcp lists. + */ + if (is_migrate_cma(fallback_mt)) + return false; - } + /* Can take ownership for orders >= pageblock_order */ + if (order >= pageblock_order) + return true; + + if (order >= pageblock_order / 2 || + start_mt == MIGRATE_RECLAIMABLE || + page_group_by_mobility_disabled) + return true; - return fallback_type; + return false; } /* Remove an element from the buddy allocator from the fallback list */ @@ -1187,6 +1193,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype) unsigned int current_order; struct page *page; int migratetype, new_type, i; + bool can_steal; /* Find the largest possible block of pages in the other list */ for (current_order = MAX_ORDER-1; @@ -1194,6 +1201,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype) --current_order) { for (i = 0;; i++) { migratetype = fallbacks[start_migratetype][i]; + new_type = migratetype; /* MIGRATE_RESERVE handled later if necessary */ if (migratetype == MIGRATE_RESERVE) @@ -1207,9 +1215,13 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype) struct page, lru); area->nr_free--; - new_type = try_to_steal_freepages(zone, page, - start_migratetype, - migratetype); + can_steal = can_steal_freepages(current_order, + start_migratetype, migratetype); + if (can_steal) { + new_type = start_migratetype; + try_to_steal_freepages(zone, page, + start_migratetype); + } /* Remove the page from the freelists */ list_del(&page->lru); @@ -1225,7 +1237,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype) set_freepage_migratetype(page, new_type); trace_mm_page_alloc_extfrag(page, order, current_order, - start_migratetype, migratetype, new_type); + start_migratetype, migratetype); return page; } -- 1.7.9.5 --------->8----------------- >From d914adae4e54e049da5ccdc041cf4bd6d51d6ad0 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> Date: Mon, 1 Dec 2014 13:42:08 +0900 Subject: [PATCH 2/2] mm/compaction: use consistent criteria for anti fragmentation Page allocator has anti fragmentation algorithm that would prevent unconditional freepages stealing on other migratetype buddy list. Compaction also has anti fragmentation algorithm, but, it is different with page allocator's one. It is that freepage should be more than pageblock order to finish the compaction if we don't find any freepage in requested migratetype buddy list. This is for mitigating fragmentation, but, it would cause the problem. Let's think about low order, non-movable, synchronous compaction. Most of pageblocks are movable so even if compaction works well, freepages made by compaction would not be on non-movable pageblock. In this case, compact_finished() would return COMPACT_CONTINUE although enough low order freepages are there. And then, compaction isn't finished until whole range of zone is compacted. After that, similar allocation request could steal the freepage on other migratetype without any effort. We should consider fragmentation, but, we need a consistent criteria to impose fair overhead on each processes. So, this patch changes compaction's anti fragmentation finish condition to page allocator's anti fragmentation criteria. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> --- include/linux/mmzone.h | 3 +++ mm/compaction.c | 31 +++++++++++++++++++++++++++++-- mm/internal.h | 1 + mm/page_alloc.c | 5 ++--- 4 files changed, 35 insertions(+), 5 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 2f0856d..87f5bb5 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -63,6 +63,9 @@ enum { MIGRATE_TYPES }; +#define FALLBACK_MIGRATETYPES (4) +extern int fallbacks[MIGRATE_TYPES][FALLBACK_MIGRATETYPES]; + #ifdef CONFIG_CMA # define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA) #else diff --git a/mm/compaction.c b/mm/compaction.c index 52e2e69..e9dcc6c 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1049,6 +1049,30 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE; } +static bool can_steal_fallbacks(struct free_area *area, + unsigned int order, int migratetype) +{ + int i; + int fallback_mt; + + if (area->nr_free == 0) + return false; + + for (i = 0; i < FALLBACK_MIGRATETYPES; i++) { + fallback_mt = fallbacks[migratetype][i]; + if (fallback_mt == MIGRATE_RESERVE) + break; + + if (list_empty(&area->free_list[fallback_mt])) + continue; + + if (can_steal_freepages(order, migratetype, fallback_mt)) + return true; + } + + return false; +} + static int __compact_finished(struct zone *zone, struct compact_control *cc, const int migratetype) { @@ -1099,8 +1123,11 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc, if (!list_empty(&area->free_list[migratetype])) return COMPACT_PARTIAL; - /* Job done if allocation would set block type */ - if (order >= pageblock_order && area->nr_free) + /* + * Job done if allocation would steal freepages from + * other migratetype buddy lists. + */ + if (can_steal_fallbacks(area, order, migratetype)) return COMPACT_PARTIAL; } diff --git a/mm/internal.h b/mm/internal.h index efad241..7028d83 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -179,6 +179,7 @@ unsigned long isolate_migratepages_range(struct compact_control *cc, unsigned long low_pfn, unsigned long end_pfn); +bool can_steal_freepages(unsigned int order, int start_mt, int fallback_mt); #endif /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 77be830..b9c329c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1031,7 +1031,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, * This array describes the order lists are fallen back to when * the free lists for the desirable migrate type are depleted */ -static int fallbacks[MIGRATE_TYPES][4] = { +int fallbacks[MIGRATE_TYPES][FALLBACK_MIGRATETYPES] = { [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, #ifdef CONFIG_CMA @@ -1161,8 +1161,7 @@ static void try_to_steal_freepages(struct zone *zone, struct page *page, } } -static bool can_steal_freepages(unsigned int order, - int start_mt, int fallback_mt) +bool can_steal_freepages(unsigned int order, int start_mt, int fallback_mt) { /* * When borrowing from MIGRATE_CMA, we need to release the excess -- 1.7.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>