On Mon, Feb 08, 2016 at 02:38:08PM +0100, Vlastimil Babka wrote: > Memory compaction can be currently performed in several contexts: > > - kswapd balancing a zone after a high-order allocation failure > - direct compaction to satisfy a high-order allocation, including THP page > fault attemps > - khugepaged trying to collapse a hugepage > - manually from /proc > > The purpose of compaction is two-fold. The obvious purpose is to satisfy a > (pending or future) high-order allocation, and is easy to evaluate. The other > purpose is to keep overal memory fragmentation low and help the > anti-fragmentation mechanism. The success wrt the latter purpose is more > difficult to evaluate though. > > The current situation wrt the purposes has a few drawbacks: > > - compaction is invoked only when a high-order page or hugepage is not > available (or manually). This might be too late for the purposes of keeping > memory fragmentation low. > - direct compaction increases latency of allocations. Again, it would be > better if compaction was performed asynchronously to keep fragmentation low, > before the allocation itself comes. > - (a special case of the previous) the cost of compaction during THP page > faults can easily offset the benefits of THP. > - kswapd compaction appears to be complex, fragile and not working in some > scenarios. It could also end up compacting for a high-order allocation > request when it should be reclaiming memory for a later order-0 request. > > To improve the situation, we should be able to benefit from an equivalent of > kswapd, but for compaction - i.e. a background thread which responds to > fragmentation and the need for high-order allocations (including hugepages) > somewhat proactively. > > One possibility is to extend the responsibilities of kswapd, which could > however complicate its design too much. It should be better to let kswapd > handle reclaim, as order-0 allocations are often more critical than high-order > ones. > > Another possibility is to extend khugepaged, but this kthread is a single > instance and tied to THP configs. > > This patch goes with the option of a new set of per-node kthreads called > kcompactd, and lays the foundations, without introducing any new tunables. > The lifecycle mimics kswapd kthreads, including the memory hotplug hooks. > > For compaction, kcompactd uses the standard compaction_suitable() and > ompact_finished() criteria and the deferred compaction functionality. Unlike > direct compaction, it uses only sync compaction, as there's no allocation > latency to minimize. > > This patch doesn't yet add a call to wakeup_kcompactd. The kswapd > compact/reclaim loop for high-order pages will be replaced by waking up > kcompactd in the next patch with the description of what's wrong with the old > approach. > > Waking up of the kcompactd threads is also tied to kswapd activity and follows > these rules: > - we don't want to affect any fastpaths, so wake up kcompactd only from the > slowpath, as it's done for kswapd > - if kswapd is doing reclaim, it's more important than compaction, so don't > invoke kcompactd until kswapd goes to sleep > - the target order used for kswapd is passed to kcompactd > > Future possible future uses for kcompactd include the ability to wake up > kcompactd on demand in special situations, such as when hugepages are not > available (currently not done due to __GFP_NO_KSWAPD) or when a fragmentation > event (i.e. __rmqueue_fallback()) occurs. It's also possible to perform > periodic compaction with kcompactd. > > Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> > --- > include/linux/compaction.h | 16 +++ > include/linux/mmzone.h | 6 ++ > include/linux/vm_event_item.h | 1 + > include/trace/events/compaction.h | 55 ++++++++++ > mm/compaction.c | 220 ++++++++++++++++++++++++++++++++++++++ > mm/memory_hotplug.c | 9 +- > mm/page_alloc.c | 3 + > mm/vmstat.c | 1 + > 8 files changed, 309 insertions(+), 2 deletions(-) > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h > index 4cd4ddf64cc7..1367c0564d42 100644 > --- a/include/linux/compaction.h > +++ b/include/linux/compaction.h > @@ -52,6 +52,10 @@ extern void compaction_defer_reset(struct zone *zone, int order, > bool alloc_success); > extern bool compaction_restarting(struct zone *zone, int order); > > +extern int kcompactd_run(int nid); > +extern void kcompactd_stop(int nid); > +extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx); > + > #else > static inline unsigned long try_to_compact_pages(gfp_t gfp_mask, > unsigned int order, int alloc_flags, > @@ -84,6 +88,18 @@ static inline bool compaction_deferred(struct zone *zone, int order) > return true; > } > > +static int kcompactd_run(int nid) > +{ > + return 0; > +} > +static void kcompactd_stop(int nid) > +{ > +} > + > +static void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx) > +{ > +} > + > #endif /* CONFIG_COMPACTION */ > > #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 48cb6f0c6083..c8dfb14105c7 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -670,6 +670,12 @@ typedef struct pglist_data { > mem_hotplug_begin/end() */ > int kswapd_max_order; > enum zone_type classzone_idx; > +#ifdef CONFIG_COMPACTION > + int kcompactd_max_order; > + enum zone_type kcompactd_classzone_idx; > + wait_queue_head_t kcompactd_wait; > + struct task_struct *kcompactd; > +#endif > #ifdef CONFIG_NUMA_BALANCING > /* Lock serializing the migrate rate limiting window */ > spinlock_t numabalancing_migrate_lock; > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 67c1dbd19c6d..58ecc056ee45 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -53,6 +53,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED, > COMPACTISOLATED, > COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS, > + KCOMPACTD_WAKE, > #endif > #ifdef CONFIG_HUGETLB_PAGE > HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL, > diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h > index 111e5666e5eb..e215bf68f521 100644 > --- a/include/trace/events/compaction.h > +++ b/include/trace/events/compaction.h > @@ -350,6 +350,61 @@ DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_defer_reset, > ); > #endif > > +TRACE_EVENT(mm_compaction_kcompactd_sleep, > + > + TP_PROTO(int nid), > + > + TP_ARGS(nid), > + > + TP_STRUCT__entry( > + __field(int, nid) > + ), > + > + TP_fast_assign( > + __entry->nid = nid; > + ), > + > + TP_printk("nid=%d", __entry->nid) > +); > + > +DECLARE_EVENT_CLASS(kcompactd_wake_template, > + > + TP_PROTO(int nid, int order, enum zone_type classzone_idx), > + > + TP_ARGS(nid, order, classzone_idx), > + > + TP_STRUCT__entry( > + __field(int, nid) > + __field(int, order) > + __field(enum zone_type, classzone_idx) > + ), > + > + TP_fast_assign( > + __entry->nid = nid; > + __entry->order = order; > + __entry->classzone_idx = classzone_idx; > + ), > + > + TP_printk("nid=%d order=%d classzone_idx=%-8s", > + __entry->nid, > + __entry->order, > + __print_symbolic(__entry->classzone_idx, ZONE_TYPE)) > +); > + > +DEFINE_EVENT(kcompactd_wake_template, mm_compaction_wakeup_kcompactd, > + > + TP_PROTO(int nid, int order, enum zone_type classzone_idx), > + > + TP_ARGS(nid, order, classzone_idx) > +); > + > +DEFINE_EVENT(kcompactd_wake_template, mm_compaction_kcompactd_wake, > + > + TP_PROTO(int nid, int order, enum zone_type classzone_idx), > + > + TP_ARGS(nid, order, classzone_idx) > +); > + > #endif /* _TRACE_COMPACTION_H */ > > /* This part must be outside protection */ > diff --git a/mm/compaction.c b/mm/compaction.c > index 93f71d968098..c03715ba65c7 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -17,6 +17,9 @@ > #include <linux/balloon_compaction.h> > #include <linux/page-isolation.h> > #include <linux/kasan.h> > +#include <linux/kthread.h> > +#include <linux/freezer.h> > +#include <linux/module.h> > #include "internal.h" > > #ifdef CONFIG_COMPACTION > @@ -1736,4 +1739,221 @@ void compaction_unregister_node(struct node *node) > } > #endif /* CONFIG_SYSFS && CONFIG_NUMA */ > > +static inline bool kcompactd_work_requested(pg_data_t *pgdat) > +{ > + return pgdat->kcompactd_max_order > 0; > +} > + > +static bool kcompactd_node_suitable(pg_data_t *pgdat) > +{ > + int zoneid; > + struct zone *zone; > + enum zone_type classzone_idx = pgdat->kcompactd_classzone_idx; > + > + for (zoneid = 0; zoneid < classzone_idx; zoneid++) { > + zone = &pgdat->node_zones[zoneid]; > + > + if (!populated_zone(zone)) > + continue; > + > + if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0, > + classzone_idx) == COMPACT_CONTINUE) > + return true; > + } > + > + return false; > +} > + > +static void kcompactd_do_work(pg_data_t *pgdat) > +{ > + /* > + * With no special task, compact all zones so that a page of requested > + * order is allocatable. > + */ > + int zoneid; > + struct zone *zone; > + struct compact_control cc = { > + .order = pgdat->kcompactd_max_order, > + .classzone_idx = pgdat->kcompactd_classzone_idx, > + .mode = MIGRATE_SYNC_LIGHT, > + .ignore_skip_hint = true, > + > + }; > + bool success = false; > + > + trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order, > + cc.classzone_idx); > + count_vm_event(KCOMPACTD_WAKE); > + > + for (zoneid = 0; zoneid < cc.classzone_idx; zoneid++) { > + int status; > + > + zone = &pgdat->node_zones[zoneid]; > + if (!populated_zone(zone)) > + continue; > + > + if (compaction_deferred(zone, cc.order)) > + continue; > + > + if (compaction_suitable(zone, cc.order, 0, zoneid) != > + COMPACT_CONTINUE) > + continue; > + > + cc.nr_freepages = 0; > + cc.nr_migratepages = 0; > + cc.zone = zone; > + INIT_LIST_HEAD(&cc.freepages); > + INIT_LIST_HEAD(&cc.migratepages); > + > + status = compact_zone(zone, &cc); > + > + if (zone_watermark_ok(zone, cc.order, low_wmark_pages(zone), > + cc.classzone_idx, 0)) { > + success = true; > + compaction_defer_reset(zone, cc.order, false); > + } else if (cc.mode != MIGRATE_ASYNC && > + status == COMPACT_COMPLETE) { > + defer_compaction(zone, cc.order); > + } We alerady set mode to MIGRATE_SYNC_LIGHT so this cc.mode check looks weird. It would be better to change it and add some comment that we can safely call defer_compaction() here. > + > + VM_BUG_ON(!list_empty(&cc.freepages)); > + VM_BUG_ON(!list_empty(&cc.migratepages)); > + } > + > + /* > + * Regardless of success, we are done until woken up next. But remember > + * the requested order/classzone_idx in case it was higher/tighter than > + * our current ones > + */ > + if (pgdat->kcompactd_max_order <= cc.order) > + pgdat->kcompactd_max_order = 0; > + if (pgdat->classzone_idx >= cc.classzone_idx) > + pgdat->classzone_idx = pgdat->nr_zones - 1; > +} Maybe, you intend to update kcompactd_classzone_idx. > + > +void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx) > +{ > + if (!order) > + return; > + > + if (pgdat->kcompactd_max_order < order) > + pgdat->kcompactd_max_order = order; > + > + if (pgdat->kcompactd_classzone_idx > classzone_idx) > + pgdat->kcompactd_classzone_idx = classzone_idx; > + > + if (!waitqueue_active(&pgdat->kcompactd_wait)) > + return; > + > + if (!kcompactd_node_suitable(pgdat)) > + return; > + > + trace_mm_compaction_wakeup_kcompactd(pgdat->node_id, order, > + classzone_idx); > + wake_up_interruptible(&pgdat->kcompactd_wait); > +} > + > +/* > + * The background compaction daemon, started as a kernel thread > + * from the init process. > + */ > +static int kcompactd(void *p) > +{ > + pg_data_t *pgdat = (pg_data_t*)p; > + struct task_struct *tsk = current; > + > + const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); > + > + if (!cpumask_empty(cpumask)) > + set_cpus_allowed_ptr(tsk, cpumask); > + > + set_freezable(); > + > + pgdat->kcompactd_max_order = 0; > + pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1; > + > + while (!kthread_should_stop()) { > + trace_mm_compaction_kcompactd_sleep(pgdat->node_id); > + wait_event_freezable(pgdat->kcompactd_wait, > + kcompactd_work_requested(pgdat)); > + > + kcompactd_do_work(pgdat); > + } > + > + return 0; > +} > + > +/* > + * This kcompactd start function will be called by init and node-hot-add. > + * On node-hot-add, kcompactd will moved to proper cpus if cpus are hot-added. > + */ > +int kcompactd_run(int nid) > +{ > + pg_data_t *pgdat = NODE_DATA(nid); > + int ret = 0; > + > + if (pgdat->kcompactd) > + return 0; > + > + pgdat->kcompactd = kthread_run(kcompactd, pgdat, "kcompactd%d", nid); > + if (IS_ERR(pgdat->kcompactd)) { > + pr_err("Failed to start kcompactd on node %d\n", nid); > + ret = PTR_ERR(pgdat->kcompactd); > + pgdat->kcompactd = NULL; > + } > + return ret; > +} > + > +/* > + * Called by memory hotplug when all memory in a node is offlined. Caller must > + * hold mem_hotplug_begin/end(). > + */ > +void kcompactd_stop(int nid) > +{ > + struct task_struct *kcompactd = NODE_DATA(nid)->kcompactd; > + > + if (kcompactd) { > + kthread_stop(kcompactd); > + NODE_DATA(nid)->kcompactd = NULL; > + } > +} > + > +/* > + * It's optimal to keep kcompactd on the same CPUs as their memory, but > + * not required for correctness. So if the last cpu in a node goes > + * away, we get changed to run anywhere: as the first one comes back, > + * restore their cpu bindings. > + */ > +static int cpu_callback(struct notifier_block *nfb, unsigned long action, > + void *hcpu) > +{ > + int nid; > + > + if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) { > + for_each_node_state(nid, N_MEMORY) { > + pg_data_t *pgdat = NODE_DATA(nid); > + const struct cpumask *mask; > + > + mask = cpumask_of_node(pgdat->node_id); > + > + if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids) > + /* One of our CPUs online: restore mask */ > + set_cpus_allowed_ptr(pgdat->kcompactd, mask); > + } > + } > + return NOTIFY_OK; > +} > + > +static int __init kcompactd_init(void) > +{ > + int nid; > + > + for_each_node_state(nid, N_MEMORY) > + kcompactd_run(nid); > + hotcpu_notifier(cpu_callback, 0); > + return 0; > +} > + > +module_init(kcompactd_init) > + > #endif /* CONFIG_COMPACTION */ > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 46b46a9dcf81..7aa7697fedd1 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -33,6 +33,7 @@ > #include <linux/hugetlb.h> > #include <linux/memblock.h> > #include <linux/bootmem.h> > +#include <linux/compaction.h> > > #include <asm/tlbflush.h> > > @@ -1132,8 +1133,10 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ > > init_per_zone_wmark_min(); > > - if (onlined_pages) > + if (onlined_pages) { > kswapd_run(zone_to_nid(zone)); > + kcompactd_run(nid); > + } > > vm_total_pages = nr_free_pagecache_pages(); > > @@ -1907,8 +1910,10 @@ static int __ref __offline_pages(unsigned long start_pfn, > zone_pcp_update(zone); > > node_states_clear_node(node, &arg); > - if (arg.status_change_nid >= 0) > + if (arg.status_change_nid >= 0) { > kswapd_stop(node); > + kcompactd_stop(node); > + } > > vm_total_pages = nr_free_pagecache_pages(); > writeback_set_ratelimit(); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 4ca4ead6ab05..d8ada4ab70c1 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5484,6 +5484,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) > #endif > init_waitqueue_head(&pgdat->kswapd_wait); > init_waitqueue_head(&pgdat->pfmemalloc_wait); > +#ifdef CONFIG_COMPACTION > + init_waitqueue_head(&pgdat->kcompactd_wait); > +#endif > pgdat_page_ext_init(pgdat); > > for (j = 0; j < MAX_NR_ZONES; j++) { > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 69ce64f7b8d7..c9571294f61c 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -826,6 +826,7 @@ const char * const vmstat_text[] = { > "compact_stall", > "compact_fail", > "compact_success", > + "compact_kcompatd_wake", > #endif > > #ifdef CONFIG_HUGETLB_PAGE > -- > 2.7.0 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>