> -----Original Message----- > From: owner-linux-mm@xxxxxxxxx <owner-linux-mm@xxxxxxxxx> On Behalf > Of Vlastimil Babka > Sent: Friday, November 29, 2019 5:55 AM > To: Nitin Gupta <nigupta@xxxxxxxxxx>; Mel Gorman > <mgorman@xxxxxxxxxxxxxxxxxxx>; Michal Hocko <mhocko@xxxxxxxx> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>; Yu Zhao > <yuzhao@xxxxxxxxxx>; Mike Kravetz <mike.kravetz@xxxxxxxxxx>; Matthew > Wilcox <willy@xxxxxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux- > mm@xxxxxxxxx > Subject: Re: [PATCH] mm: Proactive compaction > > On 11/15/19 11:21 PM, Nitin Gupta wrote: > > For some applications we need to allocate almost all memory as > > hugepages. However, on a running system, higher order allocations can > > fail if the memory is fragmented. Linux kernel currently does > > on-demand compaction as we request more hugepages but this style of > > compaction incurs very high latency. Experiments with one-time full > > memory compaction (followed by hugepage allocations) shows that kernel > > is able to restore a highly fragmented memory state to a fairly > > compacted memory state within <1 sec for a 32G system. Such data > > suggests that a more proactive compaction can help us allocate a large > > fraction of memory as hugepages keeping allocation latencies low. > > > > For a more proactive compaction, the approach taken here is to define > > per page-node tunable called ‘hpage_compaction_effort’ which dictates > > bounds for external fragmentation for HPAGE_PMD_ORDER pages which > > kcompactd should try to maintain. > > > > The tunable is exposed through sysfs: > > /sys/kernel/mm/compaction/node-n/hpage_compaction_effort > > > > The value of this tunable is used to determine low and high thresholds > > for external fragmentation wrt HPAGE_PMD_ORDER order. > > Could we instead start with a non-tunable value that would be linked to to > e.g. the number of THP allocations between kcompactd cycles? > Anything we expose will inevitably get set to stone, I'm afraid, so I would > introduce it only as a last resort. > There have been attempts in the past to do proactive compaction without adding any new tunables. For instance, see this patch from Khalid (CC'ed): https://lkml.org/lkml/2019/8/12/1302 This patch collects allocation and fragmentation data points to predict memory exhaustion or fragmentation and triggers reclaim or compaction proactively. The main concern in the discussion for this patch was that such data collection and predictive analysis can be done from userspace together with wmark adjustment (already exposed), so no kernel changes are potentially necessary. However, in the same discussion, Michal pointed out a lack of similar interface (like wmarks for kswapd) missing to guide kcompactd: https://lkml.org/lkml/2019/8/13/730 My patch is addressing the need for that missing userspace tunable. A userspace daemon can measure fragmentation trends and adjust this tunable to let kernel know amount of effort to put into compaction to avoid hitting direct compact path. > > Note that previous version of this patch [1] was found to introduce > > too many tunables (per-order, extfrag_{low, high}) but this one > > reduces them to just (per-node, hpage_compaction_effort). Also, the > > new tunable is an opaque value instead of asking for specific bounds > > of “external fragmentation” which would have been difficult to > > estimate. The internal interpretation of this opaque value allows for future > fine-tuning. > > > > Currently, we use a simple translation from this tunable to [low, > > high] extfrag thresholds (low=100-hpage_compaction_effort, > > high=low+10%). To periodically check per-node extfrag status, we reuse > > per-node kcompactd threads which are woken up every few milliseconds > > to check the same. If any zone on its corresponding node has extfrag > > above the high threshold for the HPAGE_PMD_ORDER order, the thread > > starts compaction in background till all zones are below the low > > extfrag level for this order. By default. By default, the tunable is > > set to 0 (=> low=100%, high=100%). > > > > This patch is largely based on ideas from Michal Hocko posted here: > > https://lore.kernel.org/linux- > mm/20161230131412.GI13301@xxxxxxxxxxxxxx > > / > > > > * Performance data > > > > System: x64_64, 32G RAM, 12-cores. > > > > I made a small driver that allocates as many hugepages as possible and > > measures allocation latency: > > > > The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT > > and if that fails, tries to allocate with `GFP_TRANSHUGE | > > __GFP_RETRY_MAYFAIL`. The drives stops when both methods fail for a > > hugepage allocation. > > > > Before starting the driver, the system was fragmented from a userspace > > program that allocates all memory and then for each 2M aligned > > section, frees 3/4 of base pages using munmap. The workload is mainly > > anonymous userspace pages which are easy to move around. I > > intentionally avoided unmovable pages in this test to see how much > > latency we incur just by hitting the slow path for most allocations. > > > > (all latency values are in microseconds) > > > > - With vanilla kernel 5.4.0-rc5: > > > > percentile latency > > ---------- ------- > > 5 7 > > 10 7 > > 25 8 > > 30 8 > > 40 8 > > 50 8 > > 60 9 > > 75 215 > > 80 222 > > 90 323 > > 95 429 > > > > Total 2M hugepages allocated = 1829 (3.5G worth of hugepages out of > > 25G total free => 14% of free memory could be allocated as hugepages) > > > > - Now with kernel 5.4.0-rc5 + this patch: > > (hpage_compaction_effort = 60) > > > > percentile latency > > ---------- ------- > > 5 3 > > 10 3 > > 25 4 > > 30 4 > > 40 4 > > 50 4 > > 60 5 > > 75 6 > > 80 9 > > 90 370 > > 95 652 > > > > Total 2M hugepages allocated = 11120 (21.7G worth of hugepages out of > > 25G total free => 86% of free memory could be allocated as hugepages) > > I wonder about the 14->86% improvement. As you say, this kind of > fragmentation is easy to compact. Why wouldn't GFP_TRANSHUGE | > __GFP_RETRY_MAYFAIL attempts succeed? > I'm not too sure at this point. With kernel 5.3.0 I could allocate 80-90% of memory as hugepages under similar conditions though with very high latencies (as I reported here: https://patchwork.kernel.org/patch/11098289/) With kernel 5.4.0-x I observed this significant drop. Perhaps GFP_TRANSHUGE flag was changed between these versions. I will post future numbers with base flags to avoid such surprises in future. Thanks, Nitin > > Above workload produces a memory state which is easy to compact. > > However, if memory is filled with unmovable pages, pro-active > > compaction should essentially back off. To test this aspect, I ran a > > mix of this workload (thanks to Matthew Wilcox for suggesting these): > > > > - dentry_thrash: it opens /tmp/missing.x for x in [1, 1000000] where > > first 10000 files actually exist. > > - pagecache_thrash: it opens a 128G file (on a 32G system) and then > > reads at random offsets. > > > > With this mix of workload, system quickly reaches 90-100% > > fragmentation wrt order-9. Trace of compaction events shows that we > > keep hitting compaction_deferred event, as expected. > > > > After terminating dentry_thrash and dropping denty caches, the system > > could proceed with compaction according to set value of > > hpage_compaction_effort (60). > > > > [1] https://patchwork.kernel.org/patch/11098289/ > > > > Signed-off-by: Nitin Gupta <nigupta@xxxxxxxxxx>