Re: [PATCH v2 2/5] mm, compaction: introduce kcompactd

Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> · Wed, 2 Mar 2016 15:09:06 +0900

On Mon, Feb 08, 2016 at 02:38:08PM +0100, Vlastimil Babka wrote:
> Memory compaction can be currently performed in several contexts:
> 
> - kswapd balancing a zone after a high-order allocation failure
> - direct compaction to satisfy a high-order allocation, including THP page
>   fault attemps
> - khugepaged trying to collapse a hugepage
> - manually from /proc
> 
> The purpose of compaction is two-fold. The obvious purpose is to satisfy a
> (pending or future) high-order allocation, and is easy to evaluate. The other
> purpose is to keep overal memory fragmentation low and help the
> anti-fragmentation mechanism. The success wrt the latter purpose is more
> difficult to evaluate though.
> 
> The current situation wrt the purposes has a few drawbacks:
> 
> - compaction is invoked only when a high-order page or hugepage is not
>   available (or manually). This might be too late for the purposes of keeping
>   memory fragmentation low.
> - direct compaction increases latency of allocations. Again, it would be
>   better if compaction was performed asynchronously to keep fragmentation low,
>   before the allocation itself comes.
> - (a special case of the previous) the cost of compaction during THP page
>   faults can easily offset the benefits of THP.
> - kswapd compaction appears to be complex, fragile and not working in some
>   scenarios. It could also end up compacting for a high-order allocation
>   request when it should be reclaiming memory for a later order-0 request.
> 
> To improve the situation, we should be able to benefit from an equivalent of
> kswapd, but for compaction - i.e. a background thread which responds to
> fragmentation and the need for high-order allocations (including hugepages)
> somewhat proactively.
> 
> One possibility is to extend the responsibilities of kswapd, which could
> however complicate its design too much. It should be better to let kswapd
> handle reclaim, as order-0 allocations are often more critical than high-order
> ones.
> 
> Another possibility is to extend khugepaged, but this kthread is a single
> instance and tied to THP configs.
> 
> This patch goes with the option of a new set of per-node kthreads called
> kcompactd, and lays the foundations, without introducing any new tunables.
> The lifecycle mimics kswapd kthreads, including the memory hotplug hooks.
> 
> For compaction, kcompactd uses the standard compaction_suitable() and
> ompact_finished() criteria and the deferred compaction functionality. Unlike
> direct compaction, it uses only sync compaction, as there's no allocation
> latency to minimize.
> 
> This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
> compact/reclaim loop for high-order pages will be replaced by waking up
> kcompactd in the next patch with the description of what's wrong with the old
> approach.
> 
> Waking up of the kcompactd threads is also tied to kswapd activity and follows
> these rules:
> - we don't want to affect any fastpaths, so wake up kcompactd only from the
>   slowpath, as it's done for kswapd
> - if kswapd is doing reclaim, it's more important than compaction, so don't
>   invoke kcompactd until kswapd goes to sleep
> - the target order used for kswapd is passed to kcompactd
> 
> Future possible future uses for kcompactd include the ability to wake up
> kcompactd on demand in special situations, such as when hugepages are not
> available (currently not done due to __GFP_NO_KSWAPD) or when a fragmentation
> event (i.e. __rmqueue_fallback()) occurs. It's also possible to perform
> periodic compaction with kcompactd.
> 
> Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
> ---
>  include/linux/compaction.h        |  16 +++
>  include/linux/mmzone.h            |   6 ++
>  include/linux/vm_event_item.h     |   1 +
>  include/trace/events/compaction.h |  55 ++++++++++
>  mm/compaction.c                   | 220 ++++++++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c               |   9 +-
>  mm/page_alloc.c                   |   3 +
>  mm/vmstat.c                       |   1 +
>  8 files changed, 309 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 4cd4ddf64cc7..1367c0564d42 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -52,6 +52,10 @@ extern void compaction_defer_reset(struct zone *zone, int order,
>  				bool alloc_success);
>  extern bool compaction_restarting(struct zone *zone, int order);
>  
> +extern int kcompactd_run(int nid);
> +extern void kcompactd_stop(int nid);
> +extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
> +
>  #else
>  static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
>  			unsigned int order, int alloc_flags,
> @@ -84,6 +88,18 @@ static inline bool compaction_deferred(struct zone *zone, int order)
>  	return true;
>  }
>  
> +static int kcompactd_run(int nid)
> +{
> +	return 0;
> +}
> +static void kcompactd_stop(int nid)
> +{
> +}
> +
> +static void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
> +{
> +}
> +
>  #endif /* CONFIG_COMPACTION */
>  
>  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 48cb6f0c6083..c8dfb14105c7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -670,6 +670,12 @@ typedef struct pglist_data {
>  					   mem_hotplug_begin/end() */
>  	int kswapd_max_order;
>  	enum zone_type classzone_idx;
> +#ifdef CONFIG_COMPACTION
> +	int kcompactd_max_order;
> +	enum zone_type kcompactd_classzone_idx;
> +	wait_queue_head_t kcompactd_wait;
> +	struct task_struct *kcompactd;
> +#endif
>  #ifdef CONFIG_NUMA_BALANCING
>  	/* Lock serializing the migrate rate limiting window */
>  	spinlock_t numabalancing_migrate_lock;
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 67c1dbd19c6d..58ecc056ee45 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -53,6 +53,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
>  		COMPACTISOLATED,
>  		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> +		KCOMPACTD_WAKE,
>  #endif
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
> index 111e5666e5eb..e215bf68f521 100644
> --- a/include/trace/events/compaction.h
> +++ b/include/trace/events/compaction.h
> @@ -350,6 +350,61 @@ DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_defer_reset,
>  );
>  #endif
>  
> +TRACE_EVENT(mm_compaction_kcompactd_sleep,
> +
> +	TP_PROTO(int nid),
> +
> +	TP_ARGS(nid),
> +
> +	TP_STRUCT__entry(
> +		__field(int, nid)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->nid = nid;
> +	),
> +
> +	TP_printk("nid=%d", __entry->nid)
> +);
> +
> +DECLARE_EVENT_CLASS(kcompactd_wake_template,
> +
> +	TP_PROTO(int nid, int order, enum zone_type classzone_idx),
> +
> +	TP_ARGS(nid, order, classzone_idx),
> +
> +	TP_STRUCT__entry(
> +		__field(int, nid)
> +		__field(int, order)
> +		__field(enum zone_type, classzone_idx)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->nid = nid;
> +		__entry->order = order;
> +		__entry->classzone_idx = classzone_idx;
> +	),
> +
> +	TP_printk("nid=%d order=%d classzone_idx=%-8s",
> +		__entry->nid,
> +		__entry->order,
> +		__print_symbolic(__entry->classzone_idx, ZONE_TYPE))
> +);
> +
> +DEFINE_EVENT(kcompactd_wake_template, mm_compaction_wakeup_kcompactd,
> +
> +	TP_PROTO(int nid, int order, enum zone_type classzone_idx),
> +
> +	TP_ARGS(nid, order, classzone_idx)
> +);
> +
> +DEFINE_EVENT(kcompactd_wake_template, mm_compaction_kcompactd_wake,
> +
> +	TP_PROTO(int nid, int order, enum zone_type classzone_idx),
> +
> +	TP_ARGS(nid, order, classzone_idx)
> +);
> +
>  #endif /* _TRACE_COMPACTION_H */
>  
>  /* This part must be outside protection */
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 93f71d968098..c03715ba65c7 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -17,6 +17,9 @@
>  #include <linux/balloon_compaction.h>
>  #include <linux/page-isolation.h>
>  #include <linux/kasan.h>
> +#include <linux/kthread.h>
> +#include <linux/freezer.h>
> +#include <linux/module.h>
>  #include "internal.h"
>  
>  #ifdef CONFIG_COMPACTION
> @@ -1736,4 +1739,221 @@ void compaction_unregister_node(struct node *node)
>  }
>  #endif /* CONFIG_SYSFS && CONFIG_NUMA */
>  
> +static inline bool kcompactd_work_requested(pg_data_t *pgdat)
> +{
> +	return pgdat->kcompactd_max_order > 0;
> +}
> +
> +static bool kcompactd_node_suitable(pg_data_t *pgdat)
> +{
> +	int zoneid;
> +	struct zone *zone;
> +	enum zone_type classzone_idx = pgdat->kcompactd_classzone_idx;
> +
> +	for (zoneid = 0; zoneid < classzone_idx; zoneid++) {
> +		zone = &pgdat->node_zones[zoneid];
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
> +					classzone_idx) == COMPACT_CONTINUE)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static void kcompactd_do_work(pg_data_t *pgdat)
> +{
> +	/*
> +	 * With no special task, compact all zones so that a page of requested
> +	 * order is allocatable.
> +	 */
> +	int zoneid;
> +	struct zone *zone;
> +	struct compact_control cc = {
> +		.order = pgdat->kcompactd_max_order,
> +		.classzone_idx = pgdat->kcompactd_classzone_idx,
> +		.mode = MIGRATE_SYNC_LIGHT,
> +		.ignore_skip_hint = true,
> +
> +	};
> +	bool success = false;
> +
> +	trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
> +							cc.classzone_idx);
> +	count_vm_event(KCOMPACTD_WAKE);
> +
> +	for (zoneid = 0; zoneid < cc.classzone_idx; zoneid++) {
> +		int status;
> +
> +		zone = &pgdat->node_zones[zoneid];
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (compaction_deferred(zone, cc.order))
> +			continue;
> +
> +		if (compaction_suitable(zone, cc.order, 0, zoneid) !=
> +							COMPACT_CONTINUE)
> +			continue;
> +
> +		cc.nr_freepages = 0;
> +		cc.nr_migratepages = 0;
> +		cc.zone = zone;
> +		INIT_LIST_HEAD(&cc.freepages);
> +		INIT_LIST_HEAD(&cc.migratepages);
> +
> +		status = compact_zone(zone, &cc);
> +
> +		if (zone_watermark_ok(zone, cc.order, low_wmark_pages(zone),
> +						cc.classzone_idx, 0)) {
> +			success = true;
> +			compaction_defer_reset(zone, cc.order, false);
> +		} else if (cc.mode != MIGRATE_ASYNC &&
> +						status == COMPACT_COMPLETE) {
> +			defer_compaction(zone, cc.order);
> +		}

We alerady set mode to MIGRATE_SYNC_LIGHT so this cc.mode check looks weird.
It would be better to change it and add some comment that we can
safely call defer_compaction() here.

> +
> +		VM_BUG_ON(!list_empty(&cc.freepages));
> +		VM_BUG_ON(!list_empty(&cc.migratepages));
> +	}
> +
> +	/*
> +	 * Regardless of success, we are done until woken up next. But remember
> +	 * the requested order/classzone_idx in case it was higher/tighter than
> +	 * our current ones
> +	 */
> +	if (pgdat->kcompactd_max_order <= cc.order)
> +		pgdat->kcompactd_max_order = 0;
> +	if (pgdat->classzone_idx >= cc.classzone_idx)
> +		pgdat->classzone_idx = pgdat->nr_zones - 1;
> +}

Maybe, you intend to update kcompactd_classzone_idx.

> +
> +void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
> +{
> +	if (!order)
> +		return;
> +
> +	if (pgdat->kcompactd_max_order < order)
> +		pgdat->kcompactd_max_order = order;
> +
> +	if (pgdat->kcompactd_classzone_idx > classzone_idx)
> +		pgdat->kcompactd_classzone_idx = classzone_idx;
> +
> +	if (!waitqueue_active(&pgdat->kcompactd_wait))
> +		return;
> +
> +	if (!kcompactd_node_suitable(pgdat))
> +		return;
> +
> +	trace_mm_compaction_wakeup_kcompactd(pgdat->node_id, order,
> +							classzone_idx);
> +	wake_up_interruptible(&pgdat->kcompactd_wait);
> +}
> +
> +/*
> + * The background compaction daemon, started as a kernel thread
> + * from the init process.
> + */
> +static int kcompactd(void *p)
> +{
> +	pg_data_t *pgdat = (pg_data_t*)p;
> +	struct task_struct *tsk = current;
> +
> +	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> +
> +	if (!cpumask_empty(cpumask))
> +		set_cpus_allowed_ptr(tsk, cpumask);
> +
> +	set_freezable();
> +
> +	pgdat->kcompactd_max_order = 0;
> +	pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
> +
> +	while (!kthread_should_stop()) {
> +		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
> +		wait_event_freezable(pgdat->kcompactd_wait,
> +				kcompactd_work_requested(pgdat));
> +
> +		kcompactd_do_work(pgdat);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * This kcompactd start function will be called by init and node-hot-add.
> + * On node-hot-add, kcompactd will moved to proper cpus if cpus are hot-added.
> + */
> +int kcompactd_run(int nid)
> +{
> +	pg_data_t *pgdat = NODE_DATA(nid);
> +	int ret = 0;
> +
> +	if (pgdat->kcompactd)
> +		return 0;
> +
> +	pgdat->kcompactd = kthread_run(kcompactd, pgdat, "kcompactd%d", nid);
> +	if (IS_ERR(pgdat->kcompactd)) {
> +		pr_err("Failed to start kcompactd on node %d\n", nid);
> +		ret = PTR_ERR(pgdat->kcompactd);
> +		pgdat->kcompactd = NULL;
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Called by memory hotplug when all memory in a node is offlined. Caller must
> + * hold mem_hotplug_begin/end().
> + */
> +void kcompactd_stop(int nid)
> +{
> +	struct task_struct *kcompactd = NODE_DATA(nid)->kcompactd;
> +
> +	if (kcompactd) {
> +		kthread_stop(kcompactd);
> +		NODE_DATA(nid)->kcompactd = NULL;
> +	}
> +}
> +
> +/*
> + * It's optimal to keep kcompactd on the same CPUs as their memory, but
> + * not required for correctness. So if the last cpu in a node goes
> + * away, we get changed to run anywhere: as the first one comes back,
> + * restore their cpu bindings.
> + */
> +static int cpu_callback(struct notifier_block *nfb, unsigned long action,
> +			void *hcpu)
> +{
> +	int nid;
> +
> +	if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
> +		for_each_node_state(nid, N_MEMORY) {
> +			pg_data_t *pgdat = NODE_DATA(nid);
> +			const struct cpumask *mask;
> +
> +			mask = cpumask_of_node(pgdat->node_id);
> +
> +			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
> +				/* One of our CPUs online: restore mask */
> +				set_cpus_allowed_ptr(pgdat->kcompactd, mask);
> +		}
> +	}
> +	return NOTIFY_OK;
> +}
> +
> +static int __init kcompactd_init(void)
> +{
> +	int nid;
> +
> +	for_each_node_state(nid, N_MEMORY)
> +		kcompactd_run(nid);
> +	hotcpu_notifier(cpu_callback, 0);
> +	return 0;
> +}
> +
> +module_init(kcompactd_init)
> +
>  #endif /* CONFIG_COMPACTION */
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 46b46a9dcf81..7aa7697fedd1 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -33,6 +33,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/memblock.h>
>  #include <linux/bootmem.h>
> +#include <linux/compaction.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -1132,8 +1133,10 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>  
>  	init_per_zone_wmark_min();
>  
> -	if (onlined_pages)
> +	if (onlined_pages) {
>  		kswapd_run(zone_to_nid(zone));
> +		kcompactd_run(nid);
> +	}
>  
>  	vm_total_pages = nr_free_pagecache_pages();
>  
> @@ -1907,8 +1910,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		zone_pcp_update(zone);
>  
>  	node_states_clear_node(node, &arg);
> -	if (arg.status_change_nid >= 0)
> +	if (arg.status_change_nid >= 0) {
>  		kswapd_stop(node);
> +		kcompactd_stop(node);
> +	}
>  
>  	vm_total_pages = nr_free_pagecache_pages();
>  	writeback_set_ratelimit();
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4ca4ead6ab05..d8ada4ab70c1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5484,6 +5484,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>  #endif
>  	init_waitqueue_head(&pgdat->kswapd_wait);
>  	init_waitqueue_head(&pgdat->pfmemalloc_wait);
> +#ifdef CONFIG_COMPACTION
> +	init_waitqueue_head(&pgdat->kcompactd_wait);
> +#endif
>  	pgdat_page_ext_init(pgdat);
>  
>  	for (j = 0; j < MAX_NR_ZONES; j++) {
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69ce64f7b8d7..c9571294f61c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -826,6 +826,7 @@ const char * const vmstat_text[] = {
>  	"compact_stall",
>  	"compact_fail",
>  	"compact_success",
> +	"compact_kcompatd_wake",
>  #endif
>  
>  #ifdef CONFIG_HUGETLB_PAGE
> -- 
> 2.7.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>