The patch titled Subject: mm, compaction: drain pcps for zone when kcompactd fails has been added to the -mm tree. Its filename is mm-compaction-drain-pcps-for-zone-when-kcompactd-fails.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-drain-pcps-for-zone-when-kcompactd-fails.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-compaction-drain-pcps-for-zone-when-kcompactd-fails.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: David Rientjes <rientjes@xxxxxxxxxx> Subject: mm, compaction: drain pcps for zone when kcompactd fails It's possible for free pages to become stranded on per-cpu pagesets (pcps) that, if drained, could be merged with buddy pages on the zone's free area to form large order pages, including up to MAX_ORDER. Consider a verbose example using the tools/vm/page-types tool at the beginning of a ZONE_NORMAL ('B' indicates a buddy page and 'S' indicates a slab page). Pages on pcps do not have any page flags set. 109954 1 _______S________________________________________________________ 109955 2 __________B_____________________________________________________ 109957 1 ________________________________________________________________ 109958 1 __________B_____________________________________________________ 109959 7 ________________________________________________________________ 109960 1 __________B_____________________________________________________ 109961 9 ________________________________________________________________ 10996a 1 __________B_____________________________________________________ 10996b 3 ________________________________________________________________ 10996e 1 __________B_____________________________________________________ 10996f 1 ________________________________________________________________ ... 109f8c 1 __________B_____________________________________________________ 109f8d 2 ________________________________________________________________ 109f8f 2 __________B_____________________________________________________ 109f91 f ________________________________________________________________ 109fa0 1 __________B_____________________________________________________ 109fa1 7 ________________________________________________________________ 109fa8 1 __________B_____________________________________________________ 109fa9 1 ________________________________________________________________ 109faa 1 __________B_____________________________________________________ 109fab 1 _______S________________________________________________________ The compaction migration scanner is attempting to defragment this memory since it is at the beginning of the zone. It has done so quite well, all movable pages have been migrated. From pfn [0x109955, 0x109fab), there are only buddy pages and pages without flags set. These pages may be stranded on pcps that could otherwise allow this memory to be coalesced if freed back to the zone free area. It is possible that some of these pages may not be on pcps and that something has called alloc_pages() and used the memory directly, but we rely on the absence of __GFP_MOVABLE in these cases to allocate from MIGATE_UNMOVABLE pageblocks to try to keep these MIGRATE_MOVABLE pageblocks as free as possible. These buddy and pcp pages, spanning 1,621 pages, could be coalesced and allow for three transparent hugepages to be dynamically allocated. Running the numbers for all such spans on the system, it was found that there were over 400 such spans of only buddy pages and pages without flags set at the time this /proc/kpageflags sample was collected. Without this support, there were _no_ order-9 or order-10 pages free. When kcompactd fails to defragment memory such that a cc.order page can be allocated, drain all pcps for the zone back to the buddy allocator so this stranding cannot occur. Compaction for that order will subsequently be deferred, which acts as a ratelimit on this drain. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803010340100.88270@xxxxxxxxxxxxxxxxxxxxxxxxx Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx> Acked-by: Vlastimil Babka <vbabka@xxxxxxx> Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> Cc: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/compaction.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff -puN mm/compaction.c~mm-compaction-drain-pcps-for-zone-when-kcompactd-fails mm/compaction.c --- a/mm/compaction.c~mm-compaction-drain-pcps-for-zone-when-kcompactd-fails +++ a/mm/compaction.c @@ -1988,6 +1988,14 @@ static void kcompactd_do_work(pg_data_t compaction_defer_reset(zone, cc.order, false); } else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) { /* + * Buddy pages may become stranded on pcps that could + * otherwise coalesce on the zone's free area for + * order >= cc.order. This is ratelimited by the + * upcoming deferral. + */ + drain_all_pages(zone); + + /* * We use sync migration mode here, so we defer like * sync direct compaction does. */ _ Patches currently in -mm which might be from rientjes@xxxxxxxxxx are mm-page_alloc-extend-kernelcore-and-movablecore-for-percent.patch mm-page_alloc-extend-kernelcore-and-movablecore-for-percent-fix.patch mm-page_alloc-move-mirrored_kernelcore-to-__meminitdata.patch mm-compaction-drain-pcps-for-zone-when-kcompactd-fails.patch -- To unsubscribe from this list: send the line "unsubscribe mm-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html