On 6/9/20 12:23 PM, Khalid Aziz wrote: > On Mon, 2020-06-01 at 12:48 -0700, Nitin Gupta wrote: >> For some applications, we need to allocate almost all memory as >> hugepages. However, on a running system, higher-order allocations can >> fail if the memory is fragmented. Linux kernel currently does on- >> demand >> compaction as we request more hugepages, but this style of compaction >> incurs very high latency. Experiments with one-time full memory >> compaction (followed by hugepage allocations) show that kernel is >> able >> to restore a highly fragmented memory state to a fairly compacted >> memory >> state within <1 sec for a 32G system. Such data suggests that a more >> proactive compaction can help us allocate a large fraction of memory >> as >> hugepages keeping allocation latencies low. >> >> For a more proactive compaction, the approach taken here is to define >> a >> new sysctl called 'vm.compaction_proactiveness' which dictates bounds >> for external fragmentation which kcompactd tries to maintain. >> >> The tunable takes a value in range [0, 100], with a default of 20. >> >> Note that a previous version of this patch [1] was found to introduce >> too many tunables (per-order extfrag{low, high}), but this one >> reduces >> them to just one sysctl. Also, the new tunable is an opaque value >> instead of asking for specific bounds of "external fragmentation", >> which >> would have been difficult to estimate. The internal interpretation of >> this opaque value allows for future fine-tuning. >> >> Currently, we use a simple translation from this tunable to [low, >> high] >> "fragmentation score" thresholds (low=100-proactiveness, >> high=low+10%). >> The score for a node is defined as weighted mean of per-zone external >> fragmentation. A zone's present_pages determines its weight. >> >> To periodically check per-node score, we reuse per-node kcompactd >> threads, which are woken up every 500 milliseconds to check the same. >> If >> a node's score exceeds its high threshold (as derived from user- >> provided >> proactiveness value), proactive compaction is started until its score >> reaches its low threshold value. By default, proactiveness is set to >> 20, >> which implies threshold values of low=80 and high=90. >> >> This patch is largely based on ideas from Michal Hocko [2]. See also >> the >> LWN article [3]. >> >> Performance data >> ================ >> >> System: x64_64, 1T RAM, 80 CPU threads. >> Kernel: 5.6.0-rc3 + this patch >> >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >> >> Before starting the driver, the system was fragmented from a >> userspace >> program that allocates all memory and then for each 2M aligned >> section, >> frees 3/4 of base pages using munmap. The workload is mainly >> anonymous >> userspace pages, which are easy to move around. I intentionally >> avoided >> unmovable pages in this test to see how much latency we incur when >> hugepage allocations hit direct compaction. >> >> 1. Kernel hugepage allocation latencies >> >> With the system in such a fragmented state, a kernel driver then >> allocates as many hugepages as possible and measures allocation >> latency: >> >> (all latency values are in microseconds) >> >> - With vanilla 5.6.0-rc3 >> >> percentile latency >> –––––––––– ––––––– >> 5 7894 >> 10 9496 >> 25 12561 >> 30 15295 >> 40 18244 >> 50 21229 >> 60 27556 >> 75 30147 >> 80 31047 >> 90 32859 >> 95 33799 >> >> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of >> 762G total free => 98% of free memory could be allocated as >> hugepages) >> >> - With 5.6.0-rc3 + this patch, with proactiveness=20 >> >> sysctl -w vm.compaction_proactiveness=20 >> >> percentile latency >> –––––––––– ––––––– >> 5 2 >> 10 2 >> 25 3 >> 30 3 >> 40 3 >> 50 4 >> 60 4 >> 75 4 >> 80 4 >> 90 5 >> 95 429 >> >> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of >> 762G total free => 98% of free memory could be allocated as >> hugepages) >> >> 2. JAVA heap allocation >> >> In this test, we first fragment memory using the same method as for >> (1). >> >> Then, we start a Java process with a heap size set to 700G and >> request >> the heap to be allocated with THP hugepages. We also set THP to >> madvise >> to allow hugepage backing of this heap. >> >> /usr/bin/time >> java -Xms700G -Xmx700G -XX:+UseTransparentHugePages >> -XX:+AlwaysPreTouch >> >> The above command allocates 700G of Java heap using hugepages. >> >> - With vanilla 5.6.0-rc3 >> >> 17.39user 1666.48system 27:37.89elapsed >> >> - With 5.6.0-rc3 + this patch, with proactiveness=20 >> >> 8.35user 194.58system 3:19.62elapsed >> >> Elapsed time remains around 3:15, as proactiveness is further >> increased. >> >> Note that proactive compaction happens throughout the runtime of >> these >> workloads. The situation of one-time compaction, sufficient to supply >> hugepages for following allocation stream, can probably happen for >> more >> extreme proactiveness values, like 80 or 90. >> >> In the above Java workload, proactiveness is set to 20. The test >> starts >> with a node's score of 80 or higher, depending on the delay between >> the >> fragmentation step and starting the benchmark, which gives more-or- >> less >> time for the initial round of compaction. As t he benchmark >> consumes >> hugepages, node's score quickly rises above the high threshold (90) >> and >> proactive compaction starts again, which brings down the score to the >> low threshold level (80). Repeat. >> >> bpftrace also confirms proactive compaction running 20+ times during >> the >> runtime of this Java benchmark. kcompactd threads consume 100% of one >> of >> the CPUs while it tries to bring a node's score within thresholds. >> >> Backoff behavior >> ================ >> >> Above workloads produce a memory state which is easy to compact. >> However, if memory is filled with unmovable pages, proactive >> compaction >> should essentially back off. To test this aspect: >> >> - Created a kernel driver that allocates almost all memory as >> hugepages >> followed by freeing first 3/4 of each hugepage. >> - Set proactiveness=40 >> - Note that proactive_compact_node() is deferred maximum number of >> times >> with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check >> (=> ~30 seconds between retries). >> >> [1] https://patchwork.kernel.org/patch/11098289/ >> [2] >> https://lore.kernel.org/linux-mm/20161230131412.GI13301@xxxxxxxxxxxxxx/ >> [3] https://lwn.net/Articles/817905/ >> >> Signed-off-by: Nitin Gupta <nigupta@xxxxxxxxxx> >> Reviewed-by: Vlastimil Babka <vbabka@xxxxxxx> >> To: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> >> To: Michal Hocko <mhocko@xxxxxxxx> >> To: Vlastimil Babka <vbabka@xxxxxxx> >> CC: Matthew Wilcox <willy@xxxxxxxxxxxxx> >> CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >> CC: Mike Kravetz <mike.kravetz@xxxxxxxxxx> >> CC: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> >> CC: David Rientjes <rientjes@xxxxxxxxxx> >> CC: Nitin Gupta <ngupta@xxxxxxxxxxxxxx> >> CC: linux-kernel <linux-kernel@xxxxxxxxxxxxxxx> >> CC: linux-mm <linux-mm@xxxxxxxxx> >> CC: Linux API <linux-api@xxxxxxxxxxxxxxx> >> >> --- >> Changelog v6 vs v5: >> - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, >> and >> some cleanups (Vlastimil) >> - Cap min threshold to avoid excess compaction load in case user >> sets >> extreme values like 100 for `vm.compaction_proactiveness` sysctl >> (Khalid) >> - Add some more explanation about the effect of tunable on >> compaction >> behavior in user guide (Khalid) >> >> Changelog v5 vs v4: >> - Change tunable from sysfs to sysctl (Vlastimil) >> - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil) >> - Minor cleanups (remove redundant initializations, ...) >> >> Changelog v4 vs v3: >> - Document various functions. >> - Added admin-guide for the new tunable `proactiveness`. >> - Rename proactive_compaction_score to fragmentation_score for >> clarity. >> >> Changelog v3 vs v2: >> - Make proactiveness a global tunable and not per-node. Also >> upadated >> the >> patch description to reflect the same (Vlastimil Babka). >> - Don't start proactive compaction if kswapd is running (Vlastimil >> Babka). >> - Clarified in the description that compaction runs in parallel with >> the workload, instead of a one-time compaction followed by a >> stream >> of >> hugepage allocations. >> >> Changelog v2 vs v1: >> - Introduce per-node and per-zone "proactive compaction score". This >> score is compared against watermarks which are set according to >> user provided proactiveness value. >> - Separate code-paths for proactive compaction from targeted >> compaction >> i.e. where pgdat->kcompactd_max_order is non-zero. >> - Renamed hpage_compaction_effort -> proactiveness. In future we may >> use more than extfrag wrt hugepage size to determine proactive >> compaction score. >> --- >> Documentation/admin-guide/sysctl/vm.rst | 15 ++ >> include/linux/compaction.h | 2 + >> kernel/sysctl.c | 9 ++ >> mm/compaction.c | 183 >> +++++++++++++++++++++++- >> mm/internal.h | 1 + >> mm/vmstat.c | 18 +++ >> 6 files changed, 223 insertions(+), 5 deletions(-) > > > Looks good to me. > > Reviewed-by: Khalid Aziz <khalid.aziz@xxxxxxxxxx> > > Thanks Khalid. Andrew, this patch now has two 'Reviewed-by' (Vlastimil and Khalid). Can this be considered for 5.8 merge now? Thanks, Nitin