On Mon, 2019-09-16 at 13:16 -0700, David Rientjes wrote: > On Fri, 16 Aug 2019, Nitin Gupta wrote: > > > For some applications we need to allocate almost all memory as > > hugepages. However, on a running system, higher order allocations can > > fail if the memory is fragmented. Linux kernel currently does > > on-demand compaction as we request more hugepages but this style of > > compaction incurs very high latency. Experiments with one-time full > > memory compaction (followed by hugepage allocations) shows that kernel > > is able to restore a highly fragmented memory state to a fairly > > compacted memory state within <1 sec for a 32G system. Such data > > suggests that a more proactive compaction can help us allocate a large > > fraction of memory as hugepages keeping allocation latencies low. > > > > For a more proactive compaction, the approach taken here is to define > > per page-order external fragmentation thresholds and let kcompactd > > threads act on these thresholds. > > > > The low and high thresholds are defined per page-order and exposed > > through sysfs: > > > > /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high} > > > > Per-node kcompactd thread is woken up every few seconds to check if > > any zone on its node has extfrag above the extfrag_high threshold for > > any order, in which case the thread starts compaction in the backgrond > > till all zones are below extfrag_low level for all orders. By default > > both these thresolds are set to 100 for all orders which essentially > > disables kcompactd. > > > > To avoid wasting CPU cycles when compaction cannot help, such as when > > memory is full, we check both, extfrag > extfrag_high and > > compaction_suitable(zone). This allows kcomapctd thread to stays inactive > > even if extfrag thresholds are not met. > > > > This patch is largely based on ideas from Michal Hocko posted here: > > https://lore.kernel.org/linux-mm/20161230131412.GI13301@xxxxxxxxxxxxxx/ > > > > Testing done (on x86): > > - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30} > > respectively. > > - Use a test program to fragment memory: the program allocates all memory > > and then for each 2M aligned section, frees 3/4 of base pages using > > munmap. > > - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts > > compaction till extfrag < extfrag_low for order-9. > > > > The patch has plenty of rough edges but posting it early to see if I'm > > going in the right direction and to get some early feedback. > > > > Is there an update to this proposal or non-RFC patch that has been posted > for proactive compaction? > > We've had good success with periodically compacting memory on a regular > cadence on systems with hugepages enabled. The cadence itself is defined > by the admin but it causes khugepaged[*] to periodically wakeup and invoke > compaction in an attempt to keep zones as defragmented as possible > (perhaps more "proactive" than what is proposed here in an attempt to keep > all memory as unfragmented as possible regardless of extfrag thresholds). > It also avoids corner-cases where kcompactd could become more expensive > than what is anticipated because it is unsuccessful at compacting memory > yet the extfrag threshold is still exceeded. > > [*] Khugepaged instead of kcompactd only because this is only enabled > for systems where transparent hugepages are enabled, probably better > off in kcompactd to avoid duplicating work between two kthreads if > there is already a need for background compaction. > Discussion on this RFC patch revolved around the issue of exposing too many tunables (per-node, per-order, [low-high] extfrag thresholds). It was sort-of concluded that no admin will get these tunables right for a variety of workloads. To eliminate the need for tunables, I proposed another patch: https://patchwork.kernel.org/patch/11140067/ which does not add any tunables but extends and exports an existing function (compact_zone_order). In summary, this new patch adds a callback function which allows any driver to implement ad-hoc compaction policies. There is also a sample driver which makes use of this interface to keep hugepage external fragmentation within specified range (exposed through debugfs): https://gitlab.com/nigupta/linux/snippets/1894161 -Nitin