> -----Original Message----- > From: owner-linux-mm@xxxxxxxxx <owner-linux-mm@xxxxxxxxx> On Behalf > Of Michal Hocko > Sent: Tuesday, September 10, 2019 1:19 PM > To: Nitin Gupta <nigupta@xxxxxxxxxx> > Cc: akpm@xxxxxxxxxxxxxxxxxxxx; vbabka@xxxxxxx; > mgorman@xxxxxxxxxxxxxxxxxxx; dan.j.williams@xxxxxxxxx; > khalid.aziz@xxxxxxxxxx; Matthew Wilcox <willy@xxxxxxxxxxxxx>; Yu Zhao > <yuzhao@xxxxxxxxxx>; Qian Cai <cai@xxxxxx>; Andrey Ryabinin > <aryabinin@xxxxxxxxxxxxx>; Allison Randal <allison@xxxxxxxxxxx>; Mike > Rapoport <rppt@xxxxxxxxxxxxxxxxxx>; Thomas Gleixner > <tglx@xxxxxxxxxxxxx>; Arun KS <arunks@xxxxxxxxxxxxxx>; Wei Yang > <richard.weiyang@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux- > mm@xxxxxxxxx > Subject: Re: [PATCH] mm: Add callback for defining compaction completion > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote: > > For some applications we need to allocate almost all memory as > hugepages. > > However, on a running system, higher order allocations can fail if the > > memory is fragmented. Linux kernel currently does on-demand > compaction > > as we request more hugepages but this style of compaction incurs very > > high latency. Experiments with one-time full memory compaction > > (followed by hugepage allocations) shows that kernel is able to > > restore a highly fragmented memory state to a fairly compacted memory > > state within <1 sec for a 32G system. Such data suggests that a more > > proactive compaction can help us allocate a large fraction of memory > > as hugepages keeping allocation latencies low. > > > > In general, compaction can introduce unexpected latencies for > > applications that don't even have strong requirements for contiguous > > allocations. It is also hard to efficiently determine if the current > > system state can be easily compacted due to mixing of unmovable > > memory. Due to these reasons, automatic background compaction by the > > kernel itself is hard to get right in a way which does not hurt unsuspecting > applications or waste CPU cycles. > > We do trigger background compaction on a high order pressure from the > page allocator by waking up kcompactd. Why is that not sufficient? > Whenever kcompactd is woken up, it does just enough work to create one free page of the given order (compaction_control.order) or higher. Such a design causes very high latency for workloads where we want to allocate lots of hugepages in short period of time. With pro-active compaction we can hide much of this latency. For some more background discussion and data, please see this thread: https://patchwork.kernel.org/patch/11098289/ > > Even with these caveats, pro-active compaction can still be very > > useful in certain scenarios to reduce hugepage allocation latencies. > > This callback interface allows drivers to drive compaction based on > > their own policies like the current level of external fragmentation > > for a particular order, system load etc. > > So we do not trust the core MM to make a reasonable decision while we give > a free ticket to modules. How does this make any sense at all? How is a > random module going to make a more informed decision when it has less > visibility on the overal MM situation. > Embedding any specific policy (like: keep external fragmentation for order-9 between 30-40%) within MM core looks like a bad idea. As a driver, we can easily measure parameters like system load, current fragmentation level for any order in any zone etc. to make an informed decision. See the thread I refereed above for more background discussion. > If you need to control compaction from the userspace you have an interface > for that. It is also completely unexplained why you need a completion > callback. > /proc/sys/vm/compact_memory does whole system compaction which is often too much as a pro-active compaction strategy. To get more control over how to compaction work to do, I have added a compaction callback which controls how much work is done in one compaction cycle. For example, as a test for this patch, I have a small test driver which defines [low, high] external fragmentation thresholds for the HPAGE_ORDER. Whenever extfrag is within this range, I run compact_zone_order with a callback which returns COMPACT_CONTINUE till extfrag > low threshold and returns COMPACT_PARTIAL_SKIPPED when extfrag <= low. Here's the code for this sample driver: https://gitlab.com/nigupta/memstress/snippets/1893847 Maybe this code can be added to Documentation/... Thanks, Nitin > > > Signed-off-by: Nitin Gupta <nigupta@xxxxxxxxxx> > > --- > > include/linux/compaction.h | 10 ++++++++++ > > mm/compaction.c | 20 ++++++++++++++------ > > mm/internal.h | 2 ++ > > 3 files changed, 26 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h > > index 9569e7c786d3..1ea828450fa2 100644 > > --- a/include/linux/compaction.h > > +++ b/include/linux/compaction.h > > @@ -58,6 +58,16 @@ enum compact_result { > > COMPACT_SUCCESS, > > }; > > > > +/* Callback function to determine if compaction is finished. */ > > +typedef enum compact_result (*compact_finished_cb)( > > + struct zone *zone, int order); > > + > > +enum compact_result compact_zone_order(struct zone *zone, int order, > > + gfp_t gfp_mask, enum compact_priority prio, > > + unsigned int alloc_flags, int classzone_idx, > > + struct page **capture, > > + compact_finished_cb compact_finished_cb); > > + > > struct alloc_context; /* in mm/internal.h */ > > > > /* > > diff --git a/mm/compaction.c b/mm/compaction.c index > > 952dc2fb24e5..73e2e9246bc4 100644 > > --- a/mm/compaction.c > > +++ b/mm/compaction.c > > @@ -1872,6 +1872,9 @@ static enum compact_result > __compact_finished(struct compact_control *cc) > > return COMPACT_PARTIAL_SKIPPED; > > } > > > > + if (cc->compact_finished_cb) > > + return cc->compact_finished_cb(cc->zone, cc->order); > > + > > if (is_via_compact_memory(cc->order)) > > return COMPACT_CONTINUE; > > > > @@ -2274,10 +2277,11 @@ compact_zone(struct compact_control *cc, > struct capture_control *capc) > > return ret; > > } > > > > -static enum compact_result compact_zone_order(struct zone *zone, int > > order, > > +enum compact_result compact_zone_order(struct zone *zone, int order, > > gfp_t gfp_mask, enum compact_priority prio, > > unsigned int alloc_flags, int classzone_idx, > > - struct page **capture) > > + struct page **capture, > > + compact_finished_cb compact_finished_cb) > > { > > enum compact_result ret; > > struct compact_control cc = { > > @@ -2293,10 +2297,11 @@ static enum compact_result > compact_zone_order(struct zone *zone, int order, > > MIGRATE_ASYNC : > MIGRATE_SYNC_LIGHT, > > .alloc_flags = alloc_flags, > > .classzone_idx = classzone_idx, > > - .direct_compaction = true, > > + .direct_compaction = !compact_finished_cb, > > .whole_zone = (prio == MIN_COMPACT_PRIORITY), > > .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY), > > - .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY) > > + .ignore_block_suitable = (prio == > MIN_COMPACT_PRIORITY), > > + .compact_finished_cb = compact_finished_cb > > }; > > struct capture_control capc = { > > .cc = &cc, > > @@ -2313,11 +2318,13 @@ static enum compact_result > compact_zone_order(struct zone *zone, int order, > > VM_BUG_ON(!list_empty(&cc.freepages)); > > VM_BUG_ON(!list_empty(&cc.migratepages)); > > > > - *capture = capc.page; > > + if (capture) > > + *capture = capc.page; > > current->capture_control = NULL; > > > > return ret; > > } > > +EXPORT_SYMBOL(compact_zone_order); > > > > int sysctl_extfrag_threshold = 500; > > > > @@ -2361,7 +2368,8 @@ enum compact_result > try_to_compact_pages(gfp_t gfp_mask, unsigned int order, > > } > > > > status = compact_zone_order(zone, order, gfp_mask, prio, > > - alloc_flags, ac_classzone_idx(ac), capture); > > + alloc_flags, ac_classzone_idx(ac), capture, > > + NULL); > > rc = max(status, rc); > > > > /* The allocation should succeed, stop compacting */ diff --git > > a/mm/internal.h b/mm/internal.h index e32390802fd3..f873f7c2b9dc > > 100644 > > --- a/mm/internal.h > > +++ b/mm/internal.h > > @@ -11,6 +11,7 @@ > > #include <linux/mm.h> > > #include <linux/pagemap.h> > > #include <linux/tracepoint-defs.h> > > +#include <linux/compaction.h> > > > > /* > > * The set of flags that only affect watermark checking and reclaim > > @@ -203,6 +204,7 @@ struct compact_control { > > bool whole_zone; /* Whole zone should/has been > scanned */ > > bool contended; /* Signal lock or sched > contention */ > > bool rescan; /* Rescanning the same pageblock */ > > + compact_finished_cb compact_finished_cb; > > }; > > > > /* > > -- > > 2.21.0 > > -- > Michal Hocko > SUSE Labs