On 11/15/19 11:21 PM, Nitin Gupta wrote: > For some applications we need to allocate almost all memory as > hugepages. However, on a running system, higher order allocations can > fail if the memory is fragmented. Linux kernel currently does on-demand > compaction as we request more hugepages but this style of compaction > incurs very high latency. Experiments with one-time full memory > compaction (followed by hugepage allocations) shows that kernel is able > to restore a highly fragmented memory state to a fairly compacted memory > state within <1 sec for a 32G system. Such data suggests that a more > proactive compaction can help us allocate a large fraction of memory as > hugepages keeping allocation latencies low. > > For a more proactive compaction, the approach taken here is to define > per page-node tunable called ‘hpage_compaction_effort’ which dictates > bounds for external fragmentation for HPAGE_PMD_ORDER pages which > kcompactd should try to maintain. > > The tunable is exposed through sysfs: > /sys/kernel/mm/compaction/node-n/hpage_compaction_effort > > The value of this tunable is used to determine low and high thresholds > for external fragmentation wrt HPAGE_PMD_ORDER order. Could we instead start with a non-tunable value that would be linked to to e.g. the number of THP allocations between kcompactd cycles? Anything we expose will inevitably get set to stone, I'm afraid, so I would introduce it only as a last resort. > Note that previous version of this patch [1] was found to introduce too > many tunables (per-order, extfrag_{low, high}) but this one reduces them > to just (per-node, hpage_compaction_effort). Also, the new tunable is an > opaque value instead of asking for specific bounds of “external > fragmentation” which would have been difficult to estimate. The internal > interpretation of this opaque value allows for future fine-tuning. > > Currently, we use a simple translation from this tunable to [low, high] > extfrag thresholds (low=100-hpage_compaction_effort, high=low+10%). To > periodically check per-node extfrag status, we reuse per-node kcompactd > threads which are woken up every few milliseconds to check the same. If > any zone on its corresponding node has extfrag above the high threshold > for the HPAGE_PMD_ORDER order, the thread starts compaction in > background till all zones are below the low extfrag level for this > order. By default. By default, the tunable is set to 0 (=> low=100%, > high=100%). > > This patch is largely based on ideas from Michal Hocko posted here: > https://lore.kernel.org/linux-mm/20161230131412.GI13301@xxxxxxxxxxxxxx/ > > * Performance data > > System: x64_64, 32G RAM, 12-cores. > > I made a small driver that allocates as many hugepages as possible and > measures allocation latency: > > The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT > and if that fails, tries to allocate with `GFP_TRANSHUGE | > __GFP_RETRY_MAYFAIL`. The drives stops when both methods fail for a > hugepage allocation. > > Before starting the driver, the system was fragmented from a userspace > program that allocates all memory and then for each 2M aligned section, > frees 3/4 of base pages using munmap. The workload is mainly anonymous > userspace pages which are easy to move around. I intentionally avoided > unmovable pages in this test to see how much latency we incur just by > hitting the slow path for most allocations. > > (all latency values are in microseconds) > > - With vanilla kernel 5.4.0-rc5: > > percentile latency > ---------- ------- > 5 7 > 10 7 > 25 8 > 30 8 > 40 8 > 50 8 > 60 9 > 75 215 > 80 222 > 90 323 > 95 429 > > Total 2M hugepages allocated = 1829 (3.5G worth of hugepages out of 25G > total free => 14% of free memory could be allocated as hugepages) > > - Now with kernel 5.4.0-rc5 + this patch: > (hpage_compaction_effort = 60) > > percentile latency > ---------- ------- > 5 3 > 10 3 > 25 4 > 30 4 > 40 4 > 50 4 > 60 5 > 75 6 > 80 9 > 90 370 > 95 652 > > Total 2M hugepages allocated = 11120 (21.7G worth of hugepages out of > 25G total free => 86% of free memory could be allocated as hugepages) I wonder about the 14->86% improvement. As you say, this kind of fragmentation is easy to compact. Why wouldn't GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL attempts succeed? Thanks, Vlastimil > Above workload produces a memory state which is easy to compact. > However, if memory is filled with unmovable pages, pro-active compaction > should essentially back off. To test this aspect, I ran a mix of this > workload (thanks to Matthew Wilcox for suggesting these): > > - dentry_thrash: it opens /tmp/missing.x for x in [1, 1000000] where > first 10000 files actually exist. > - pagecache_thrash: it opens a 128G file (on a 32G system) and then > reads at random offsets. > > With this mix of workload, system quickly reaches 90-100% fragmentation > wrt order-9. Trace of compaction events shows that we keep hitting > compaction_deferred event, as expected. > > After terminating dentry_thrash and dropping denty caches, the system > could proceed with compaction according to set value of > hpage_compaction_effort (60). > > [1] https://patchwork.kernel.org/patch/11098289/ > > Signed-off-by: Nitin Gupta <nigupta@xxxxxxxxxx>