On Tue, Jan 26, 2016 at 04:36:14PM +0100, Vlastimil Babka wrote: > Memory compaction can be currently performed in several contexts: > > - kswapd balancing a zone after a high-order allocation failure > - direct compaction to satisfy a high-order allocation, including THP page > fault attemps > - khugepaged trying to collapse a hugepage > - manually from /proc > > The purpose of compaction is two-fold. The obvious purpose is to satisfy a > (pending or future) high-order allocation, and is easy to evaluate. The other > purpose is to keep overal memory fragmentation low and help the > anti-fragmentation mechanism. The success wrt the latter purpose is more > difficult to evaluate though. > > The current situation wrt the purposes has a few drawbacks: > > - compaction is invoked only when a high-order page or hugepage is not > available (or manually). This might be too late for the purposes of keeping > memory fragmentation low. > - direct compaction increases latency of allocations. Again, it would be > better if compaction was performed asynchronously to keep fragmentation low, > before the allocation itself comes. > - (a special case of the previous) the cost of compaction during THP page > faults can easily offset the benefits of THP. > - kswapd compaction appears to be complex, fragile and not working in some > scenarios > An addendum to that is that kswapd can be compacting for a high-order allocation request when it should be reclaiming memory for an order-0 request. My recollection is that kswapd compacting was meant to help atomic high-order allocations but I wonder if the same problem even exists with the revised watermark handling. > To improve the situation, we should be able to benefit from an equivalent of > kswapd, but for compaction - i.e. a background thread which responds to > fragmentation and the need for high-order allocations (including hugepages) > somewhat proactively. > > One possibility is to extend the responsibilities of kswapd, which could > however complicate its design too much. It should be better to let kswapd > handle reclaim, as order-0 allocations are often more critical than high-order > ones. > > Another possibility is to extend khugepaged, but this kthread is a single > instance and tied to THP configs. > That also would not handle the atomic high-order allocation case. > This patch goes with the option of a new set of per-node kthreads called > kcompactd, and lays the foundations, without introducing any new tunables. > The lifecycle mimics kswapd kthreads, including the memory hotplug hooks. > > Waking up of the kcompactd threads is also tied to kswapd activity and follows > these rules: > - we don't want to affect any fastpaths, so wake up kcompactd only from the > slowpath, as it's done for kswapd Ok > - if kswapd is doing reclaim, it's more important than compaction, so don't > invoke kcompactd until kswapd goes to sleep This makes sense given that kswapd can be reclaiming order-0 pages so compaction can even start with a reasonable chance of success. > - the target order used for kswapd is passed to kcompactd > > The kswapd compact/reclaim loop for high-order pages will be removed in the > next patch with the description of what's wrong with it. > > In this patch, kcompactd uses the standard compaction_suitable() and > compact_finished() criteria, which means it will most likely have nothing left > to do after kswapd finishes, until the next patch. Kcompactd also mimics > direct compaction somewhat by trying async compaction first and sync compaction > afterwards, and uses the deferred compaction functionality. > Why should it try async compaction first? The deferred compaction makes sense as kcompact will need some sort of limitation on the amount of CPU it can use. > Future possible future uses for kcompactd include the ability to wake up > kcompactd on demand in special situations, such as when hugepages are not > available (currently not done due to __GFP_NO_KSWAPD) or when a fragmentation > event (i.e. __rmqueue_fallback()) occurs. It's also possible to perform > periodic compaction with kcompactd. > > Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> > --- > include/linux/compaction.h | 16 +++ > include/linux/mmzone.h | 6 + > include/linux/vm_event_item.h | 1 + > include/trace/events/compaction.h | 55 +++++++++ > mm/compaction.c | 227 ++++++++++++++++++++++++++++++++++++++ > mm/memory_hotplug.c | 15 ++- > mm/page_alloc.c | 3 + > mm/vmscan.c | 6 + > mm/vmstat.c | 1 + > 9 files changed, 325 insertions(+), 5 deletions(-) > > <SNIP> > > diff --git a/mm/compaction.c b/mm/compaction.c > index 585de54dbe8c..7452975fa481 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -17,6 +17,9 @@ > #include <linux/balloon_compaction.h> > #include <linux/page-isolation.h> > #include <linux/kasan.h> > +#include <linux/kthread.h> > +#include <linux/freezer.h> > +#include <linux/module.h> > #include "internal.h" > > #ifdef CONFIG_COMPACTION > @@ -29,6 +32,7 @@ static inline void count_compact_events(enum vm_event_item item, long delta) > { > count_vm_events(item, delta); > } > + > #else > #define count_compact_event(item) do { } while (0) > #define count_compact_events(item, delta) do { } while (0) Spurious whitespace change. > @@ -1759,4 +1763,227 @@ void compaction_unregister_node(struct node *node) > } > #endif /* CONFIG_SYSFS && CONFIG_NUMA */ > > +static bool kcompactd_work_requested(pg_data_t *pgdat) > +{ > + return pgdat->kcompactd_max_order > 0; > +} > + inline > +static bool kcompactd_node_suitable(pg_data_t *pgdat) > +{ > + int zoneid; > + struct zone *zone; > + > + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { > + zone = &pgdat->node_zones[zoneid]; > + > + if (!populated_zone(zone)) > + continue; > + > + if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0, > + pgdat->kcompactd_classzone_idx) > + == COMPACT_CONTINUE) > + return true; > + } > + > + return false; > +} > + Why does this traverse all zones and not just the ones within the classzone_idx? > +static void kcompactd_do_work(pg_data_t *pgdat) > +{ > + /* > + * With no special task, compact all zones so that a page of requested > + * order is allocatable. > + */ > + int zoneid; > + struct zone *zone; > + struct compact_control cc = { > + .order = pgdat->kcompactd_max_order, > + .classzone_idx = pgdat->kcompactd_classzone_idx, > + .mode = MIGRATE_ASYNC, > + .ignore_skip_hint = true, > + > + }; > + bool success = false; > + > + trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order, > + cc.classzone_idx); > + count_vm_event(KCOMPACTD_WAKE); > + > +retry: > + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { Again, why is classzone_idx not taken into account? > + int status; > + > + zone = &pgdat->node_zones[zoneid]; > + if (!populated_zone(zone)) > + continue; > + > + if (compaction_deferred(zone, cc.order)) > + continue; > + > + if (compaction_suitable(zone, cc.order, 0, zoneid) != > + COMPACT_CONTINUE) > + continue; > + > + cc.nr_freepages = 0; > + cc.nr_migratepages = 0; > + cc.zone = zone; > + INIT_LIST_HEAD(&cc.freepages); > + INIT_LIST_HEAD(&cc.migratepages); > + > + status = compact_zone(zone, &cc); > + > + if (zone_watermark_ok(zone, cc.order, low_wmark_pages(zone), > + cc.classzone_idx, 0)) { > + success = true; > + compaction_defer_reset(zone, cc.order, false); > + } else if (cc.mode != MIGRATE_ASYNC && > + status == COMPACT_COMPLETE) { > + defer_compaction(zone, cc.order); > + } > + > + VM_BUG_ON(!list_empty(&cc.freepages)); > + VM_BUG_ON(!list_empty(&cc.migratepages)); > + } > + > + if (!success && cc.mode == MIGRATE_ASYNC) { > + cc.mode = MIGRATE_SYNC_LIGHT; > + goto retry; > + } > + Still not getting why kcompactd should concern itself with async compaction. It's really direct compaction that cared and was trying to avoid stalls. > + * Regardless of success, we are done until woken up next. But remember > + * the requested order/classzone_idx in case it was higher/tighter than > + * our current ones > + */ > + if (pgdat->kcompactd_max_order <= cc.order) > + pgdat->kcompactd_max_order = 0; > + if (pgdat->classzone_idx >= cc.classzone_idx) > + pgdat->classzone_idx = pgdat->nr_zones - 1; > +} > + > > <SNIP> > > @@ -1042,7 +1043,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ > arg.nr_pages = nr_pages; > node_states_check_changes_online(nr_pages, zone, &arg); > > - nid = pfn_to_nid(pfn); > + nid = zone_to_nid(zone); > > ret = memory_notify(MEM_GOING_ONLINE, &arg); > ret = notifier_to_errno(ret); > @@ -1082,7 +1083,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ > pgdat_resize_unlock(zone->zone_pgdat, &flags); > > if (onlined_pages) { > - node_states_set_node(zone_to_nid(zone), &arg); > + node_states_set_node(nid, &arg); > if (need_zonelists_rebuild) > build_all_zonelists(NULL, NULL); > else Why are these two hunks necessary? > @@ -1093,8 +1094,10 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ > > init_per_zone_wmark_min(); > > - if (onlined_pages) > - kswapd_run(zone_to_nid(zone)); > + if (onlined_pages) { > + kswapd_run(nid); > + kcompactd_run(nid); > + } > > vm_total_pages = nr_free_pagecache_pages(); > > @@ -1858,8 +1861,10 @@ static int __ref __offline_pages(unsigned long start_pfn, > zone_pcp_update(zone); > > node_states_clear_node(node, &arg); > - if (arg.status_change_nid >= 0) > + if (arg.status_change_nid >= 0) { > kswapd_stop(node); > + kcompactd_stop(node); > + } > > vm_total_pages = nr_free_pagecache_pages(); > writeback_set_ratelimit(); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 63358d9f9aa9..7747eb36e789 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5212,6 +5212,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) > #endif > init_waitqueue_head(&pgdat->kswapd_wait); > init_waitqueue_head(&pgdat->pfmemalloc_wait); > +#ifdef CONFIG_COMPACTION > + init_waitqueue_head(&pgdat->kcompactd_wait); > +#endif > pgdat_page_ext_init(pgdat); > > for (j = 0; j < MAX_NR_ZONES; j++) { > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 72d52d3aef74..1449e21c55cc 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -3408,6 +3408,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) > */ > reset_isolation_suitable(pgdat); > > + /* > + * We have freed the memory, now we should compact it to make > + * allocation of the requested order possible. > + */ > + wakeup_kcompactd(pgdat, order, classzone_idx); > + > if (!kthread_should_stop()) > schedule(); > This initially confused me but it's due to patch ordering. It's silly but when this patch is applied then both kswapd and kcompactd are compacting memory. I would prefer if the patches were in reverse order but that is purely taste. While this was not a comprehensive review, I think the patch is ok in principal. While deferred compaction will keep the CPU usage under control, the main concern is that kcompactd consumes too much CPU but I do not see a case where that would trigger that kswapd would not have encountered already. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>