> -----Original Message----- > From: owner-linux-mm@xxxxxxxxx <owner-linux-mm@xxxxxxxxx> On Behalf > Of Mel Gorman > Sent: Thursday, August 22, 2019 1:52 AM > To: Nitin Gupta <nigupta@xxxxxxxxxx> > Cc: akpm@xxxxxxxxxxxxxxxxxxxx; vbabka@xxxxxxx; mhocko@xxxxxxxx; > dan.j.williams@xxxxxxxxx; Yu Zhao <yuzhao@xxxxxxxxxx>; Matthew Wilcox > <willy@xxxxxxxxxxxxx>; Qian Cai <cai@xxxxxx>; Andrey Ryabinin > <aryabinin@xxxxxxxxxxxxx>; Roman Gushchin <guro@xxxxxx>; Greg Kroah- > Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>; Kees Cook > <keescook@xxxxxxxxxxxx>; Jann Horn <jannh@xxxxxxxxxx>; Johannes > Weiner <hannes@xxxxxxxxxxx>; Arun KS <arunks@xxxxxxxxxxxxxx>; Janne > Huttunen <janne.huttunen@xxxxxxxxx>; Konstantin Khlebnikov > <khlebnikov@xxxxxxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux- > mm@xxxxxxxxx > Subject: Re: [RFC] mm: Proactive compaction > > On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote: > > For some applications we need to allocate almost all memory as > > hugepages. However, on a running system, higher order allocations can > > fail if the memory is fragmented. Linux kernel currently does > > on-demand compaction as we request more hugepages but this style of > > compaction incurs very high latency. Experiments with one-time full > > memory compaction (followed by hugepage allocations) shows that kernel > > is able to restore a highly fragmented memory state to a fairly > > compacted memory state within <1 sec for a 32G system. Such data > > suggests that a more proactive compaction can help us allocate a large > > fraction of memory as hugepages keeping allocation latencies low. > > > > Note that proactive compaction may reduce allocation latency but it is not > free either. Even though the scanning and migration may happen in a kernel > thread, tasks can incur faults while waiting for compaction to complete if the > task accesses data being migrated. This means that costs are incurred by > applications on a system that may never care about high-order allocation > latency -- particularly if the allocations typically happen at application > initialisation time. I recognise that kcompactd makes a bit of effort to > compact memory out-of-band but it also is typically triggered in response to > reclaim that was triggered by a high-order allocation request. i.e. the work > done by the thread is triggered by an allocation request that hit the slow > paths and not a preemptive measure. > Hitting the slow path for every higher-order allocation is a signification performance/latency issue for applications that requires a large number of these allocations to succeed in bursts. To get some concrete numbers, I made a small driver that allocates as many hugepages as possible and measures allocation latency: The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT (referred to as "Light" in the table below) and if that fails, tries to allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as "Fallback" in table below). We stop the allocation loop if both methods fail. Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All latencies are in microsec. | GFP/Stat | Any | Light | Fallback | |--------: | ---------: | ------: | ---------: | | count | 9908 | 788 | 9120 | | min | 0.0 | 0.0 | 1726.0 | | max | 135387.0 | 142.0 | 135387.0 | | mean | 5494.66 | 1.83 | 5969.26 | | stddev | 21624.04 | 7.58 | 22476.06 | As you can see, the mean and stddev of allocation is extremely high with the current approach of on-demand compaction. The system was fragmented from a userspace program as I described in this patch description. The workload is mainly anonymous userspace pages which as easy to move around. I intentionally avoided unmovable pages in this test to see how much latency do we incur just by hitting the slow path for a majority of allocations. > > For a more proactive compaction, the approach taken here is to define > > per page-order external fragmentation thresholds and let kcompactd > > threads act on these thresholds. > > > > The low and high thresholds are defined per page-order and exposed > > through sysfs: > > > > /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high} > > > > These will be difficult for an admin to tune that is not extremely familiar with > how external fragmentation is defined. If an admin asked "how much will > stalls be reduced by setting this to a different value?", the answer will always > be "I don't know, maybe some, maybe not". > Yes, this is my main worry. These values can be set to emperically determined values on highly specialized systems like database appliances. However, on a generic system, there is no real reasonable value. Still, at the very least, I would like an interface that allows compacting system to a reasonable state. Something like: compact_extfrag(node, zone, order, high, low) which start compaction if extfrag > high, and goes on till extfrag < low. It's possible that there are too many unmovable pages mixed around for compaction to succeed, still it's a reasonable interface to expose rather than forced on-demand style of compaction (please see data below). How (and if) to expose it to userspace (sysfs etc.) can be a separate discussion. > > Per-node kcompactd thread is woken up every few seconds to check if > > any zone on its node has extfrag above the extfrag_high threshold for > > any order, in which case the thread starts compaction in the backgrond > > till all zones are below extfrag_low level for all orders. By default > > both these thresolds are set to 100 for all orders which essentially > > disables kcompactd. > > > > To avoid wasting CPU cycles when compaction cannot help, such as when > > memory is full, we check both, extfrag > extfrag_high and > > compaction_suitable(zone). This allows kcomapctd thread to stays > > inactive even if extfrag thresholds are not met. > > > > There is still a risk that if a system is completely fragmented that it may > consume CPU on pointless compaction cycles. This is why compaction from > kernel thread context makes no special effort and bails relatively quickly and > assumes that if an application really needs high-order pages that it'll incur > the cost at allocation time. > As data in Table-1 shows, on-demand compaction can add high latency to every single allocation. I think it would be a significant improvement (see Table-2) to at least expose an interface to allow proactive compaction (like compaction_extfrag), which a driver can itself run in background. This way, we need not add any tunables to the kernel itself and leave compaction decision to specialized kernel/userspace monitors. > > This patch is largely based on ideas from Michal Hocko posted here: > > https://lore.kernel.org/linux- > mm/20161230131412.GI13301@xxxxxxxxxxxxxx > > / > > > > Testing done (on x86): > > - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30} > > respectively. > > - Use a test program to fragment memory: the program allocates all > > memory and then for each 2M aligned section, frees 3/4 of base pages > > using munmap. > > - kcompactd0 detects fragmentation for order-9 > extfrag_high and > > starts compaction till extfrag < extfrag_low for order-9. > > > > This is a somewhat optimisitic allocation scenario. The interesting ones are > when a system is fragmenteed in a manner that is not trivial to resolve -- e.g. > after a prolonged period of time with unmovable/reclaimable allocations > stealing pageblocks. It's also fairly difficult to analyse if this is helping > because you cannot measure after the fact how much time was saved in > allocation time due to the work done by kcompactd. It is also hard to > determine if the sum of the stalls incurred by proactive compaction is lower > than the time saved at allocation time. > > I fear that the user-visible effect will be times when there are very short but > numerous stalls due to proactive compaction running in the background that > will be hard to detect while the benefits may be invisible. > Pro-active compaction can be done in a non-time-critical context, so to estimate its benefits we can just compare data from Table-1 the same run, under a similar fragmentation state, but with this patch applied: Table-2: hugepage allocation latencies with this patch applied on 5.3.0-rc5. | GFP_Stat | Any | Light | Fallback | | --------:| ----------:| ---------:| ----------:| | count | 12197.0 | 11167.0 | 1030.0 | | min | 2.0 | 2.0 | 5.0 | | max | 361727.0 | 26.0 | 361727.0 | | mean | 366.05 | 4.48 | 4286.13 | | stddev | 4575.53 | 1.41 | 15209.63 | We can see that mean latency dropped to 366us compared with 5494us before. This is an optimistic scenario where there was a little mix of unmovable pages but still the data shows that in case compaction can succeed, pro-active compaction can give signification reduction higher-order allocation latencies. > > The patch has plenty of rough edges but posting it early to see if I'm > > going in the right direction and to get some early feedback. > > > > As unappealing as it sounds, I think it is better to try improve the allocation > latency itself instead of trying to hide the cost in a kernel thread. It's far > harder to implement as compaction is not easy but it would be more > obvious what the savings are by looking at a histogram of allocation latencies > -- there are other metrics that could be considered but that's the obvious > one. > Improving allocation latencies in itself would be a separate effort. In case memory is full or fragmented we have to deal with reclaim or compaction to make allocation (esp. higher-order) succeed. In particular, forcing compaction to be done only on-demand is in my opinion not the right approach. As I detailed above, at the very minimum, we need an interface like `compact_extfrag` which can leave the decision on specific kernel/userspace drivers on how pro-active you want compaction to be.