On Fri, 16 Aug 2019, Nitin Gupta wrote: > For some applications we need to allocate almost all memory as > hugepages. However, on a running system, higher order allocations can > fail if the memory is fragmented. Linux kernel currently does > on-demand compaction as we request more hugepages but this style of > compaction incurs very high latency. Experiments with one-time full > memory compaction (followed by hugepage allocations) shows that kernel > is able to restore a highly fragmented memory state to a fairly > compacted memory state within <1 sec for a 32G system. Such data > suggests that a more proactive compaction can help us allocate a large > fraction of memory as hugepages keeping allocation latencies low. > > For a more proactive compaction, the approach taken here is to define > per page-order external fragmentation thresholds and let kcompactd > threads act on these thresholds. > > The low and high thresholds are defined per page-order and exposed > through sysfs: > > /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high} > > Per-node kcompactd thread is woken up every few seconds to check if > any zone on its node has extfrag above the extfrag_high threshold for > any order, in which case the thread starts compaction in the backgrond > till all zones are below extfrag_low level for all orders. By default > both these thresolds are set to 100 for all orders which essentially > disables kcompactd. > > To avoid wasting CPU cycles when compaction cannot help, such as when > memory is full, we check both, extfrag > extfrag_high and > compaction_suitable(zone). This allows kcomapctd thread to stays inactive > even if extfrag thresholds are not met. > > This patch is largely based on ideas from Michal Hocko posted here: > https://lore.kernel.org/linux-mm/20161230131412.GI13301@xxxxxxxxxxxxxx/ > > Testing done (on x86): > - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30} > respectively. > - Use a test program to fragment memory: the program allocates all memory > and then for each 2M aligned section, frees 3/4 of base pages using > munmap. > - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts > compaction till extfrag < extfrag_low for order-9. > > The patch has plenty of rough edges but posting it early to see if I'm > going in the right direction and to get some early feedback. > Is there an update to this proposal or non-RFC patch that has been posted for proactive compaction? We've had good success with periodically compacting memory on a regular cadence on systems with hugepages enabled. The cadence itself is defined by the admin but it causes khugepaged[*] to periodically wakeup and invoke compaction in an attempt to keep zones as defragmented as possible (perhaps more "proactive" than what is proposed here in an attempt to keep all memory as unfragmented as possible regardless of extfrag thresholds). It also avoids corner-cases where kcompactd could become more expensive than what is anticipated because it is unsuccessful at compacting memory yet the extfrag threshold is still exceeded. [*] Khugepaged instead of kcompactd only because this is only enabled for systems where transparent hugepages are enabled, probably better off in kcompactd to avoid duplicating work between two kthreads if there is already a need for background compaction.