On Mon, 2019-08-26 at 12:47 +0100, Mel Gorman wrote: > On Thu, Aug 22, 2019 at 09:57:22PM +0000, Nitin Gupta wrote: > > > Note that proactive compaction may reduce allocation latency but > > > it is not > > > free either. Even though the scanning and migration may happen in > > > a kernel > > > thread, tasks can incur faults while waiting for compaction to > > > complete if the > > > task accesses data being migrated. This means that costs are > > > incurred by > > > applications on a system that may never care about high-order > > > allocation > > > latency -- particularly if the allocations typically happen at > > > application > > > initialisation time. I recognise that kcompactd makes a bit of > > > effort to > > > compact memory out-of-band but it also is typically triggered in > > > response to > > > reclaim that was triggered by a high-order allocation request. > > > i.e. the work > > > done by the thread is triggered by an allocation request that hit > > > the slow > > > paths and not a preemptive measure. > > > > > > > Hitting the slow path for every higher-order allocation is a > > signification > > performance/latency issue for applications that requires a large > > number of > > these allocations to succeed in bursts. To get some concrete > > numbers, I > > made a small driver that allocates as many hugepages as possible > > and > > measures allocation latency: > > > > Every higher-order allocation does not necessarily hit the slow path > nor > does it incur equal latency. I did not mean *every* hugepage allocation in a literal sense. I meant to say: higher order allocation *tend* to hit slow path with a high probability under reasonably fragmented memory state and when they do, they incur high latency. > > > The driver first tries to allocate hugepage using > > GFP_TRANSHUGE_LIGHT > > (referred to as "Light" in the table below) and if that fails, > > tries to > > allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as > > "Fallback" in table below). We stop the allocation loop if both > > methods > > fail. > > > > Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All > > latencies > > are in microsec. > > > > > GFP/Stat | Any | Light | Fallback | > > > --------: | ---------: | ------: | ---------: | > > > count | 9908 | 788 | 9120 | > > > min | 0.0 | 0.0 | 1726.0 | > > > max | 135387.0 | 142.0 | 135387.0 | > > > mean | 5494.66 | 1.83 | 5969.26 | > > > stddev | 21624.04 | 7.58 | 22476.06 | > > Given that it is expected that there would be significant tail > latencies, > it would be better to analyse this in terms of percentiles. A very > small > number of high latency allocations would skew the mean significantly > which is hinted by the stddev. > Here is the same data in terms of percentiles: - with vanilla kernel 5.3.0-rc5: percentile latency –––––––––– ––––––– 5 1 10 179 0 25 1829 30 1838 40 1854 50 18 71 60 1890 75 1924 80 1945 90 2 206 95 2302 - Now with kernel 5.3.0-rc5 + this patch: percentile latency –––––––––– ––––––– 5 3 10 4 25 4 30 4 40 4 50 4 60 4 75 5 80 5 90 9 95 1 154 > > As you can see, the mean and stddev of allocation is extremely high > > with > > the current approach of on-demand compaction. > > > > The system was fragmented from a userspace program as I described > > in this > > patch description. The workload is mainly anonymous userspace pages > > which > > as easy to move around. I intentionally avoided unmovable pages in > > this > > test to see how much latency do we incur just by hitting the slow > > path for > > a majority of allocations. > > > > Even though, the penalty for proactive compaction is that > applications > that may have no interest in higher-order pages may still stall while > their data is migrated if the data is hot. This is why I think the > focus > should be on reducing the latency of compaction -- it benefits > applications that require higher-order latencies without increasing > the > overhead for unrelated applications. > Sure, reducing compaction latency would help but there should still be an option to proactively compact to hide latencies further. > > > > For a more proactive compaction, the approach taken here is to > > > > define > > > > per page-order external fragmentation thresholds and let > > > > kcompactd > > > > threads act on these thresholds. > > > > > > > > The low and high thresholds are defined per page-order and > > > > exposed > > > > through sysfs: > > > > > > > > /sys/kernel/mm/compaction/order- > > > > [1..MAX_ORDER]/extfrag_{low,high} > > > > > > > > > > These will be difficult for an admin to tune that is not > > > extremely familiar with > > > how external fragmentation is defined. If an admin asked "how > > > much will > > > stalls be reduced by setting this to a different value?", the > > > answer will always > > > be "I don't know, maybe some, maybe not". > > > > > > > Yes, this is my main worry. These values can be set to emperically > > determined values on highly specialized systems like database > > appliances. > > However, on a generic system, there is no real reasonable value. > > > > Yep, which means the tunable will be vulnerable to cargo-cult tuning > recommendations. Or worse, the tuning recommendation will be a flat > "disable THP". > I thought more on this and yes, exposing a system wide per-order extfrag threshold may not be the best approach. Instead, expose a specific interface to compact a zone to a specified level and leave the policy on when to trigger (based on extfrag levels, system load etc.) upto the user (kernel driver or userspace daemon). > > Still, at the very least, I would like an interface that allows > > compacting > > system to a reasonable state. Something like: > > > > compact_extfrag(node, zone, order, high, low) > > > > which start compaction if extfrag > high, and goes on till extfrag > > < low. > > > > It's possible that there are too many unmovable pages mixed around > > for > > compaction to succeed, still it's a reasonable interface to expose > > rather > > than forced on-demand style of compaction (please see data below). > > > > How (and if) to expose it to userspace (sysfs etc.) can be a > > separate > > discussion. > > > > That would be functionally similar to vm.compact_memory although it > would either need an extension or a separate tunable. With sysfs, > there > could be a per-node file that takes with a watermark and order tuple > to > trigger the interface. > Something like: /sys/kernel/mm/node-n/compact or, /sys/kernel/mm/compact-n where n in [0, NUM_NODES], which takes tuple watermark and order, should do? I'm also okay not adding any of these sysfs interface for now. > > > > Per-node kcompactd thread is woken up every few seconds to > > > > check if > > > > any zone on its node has extfrag above the extfrag_high > > > > threshold for > > > > any order, in which case the thread starts compaction in the > > > > backgrond > > > > till all zones are below extfrag_low level for all orders. By > > > > default > > > > both these thresolds are set to 100 for all orders which > > > > essentially > > > > disables kcompactd. > > > > > > > > To avoid wasting CPU cycles when compaction cannot help, such > > > > as when > > > > memory is full, we check both, extfrag > extfrag_high and > > > > compaction_suitable(zone). This allows kcomapctd thread to > > > > stays > > > > inactive even if extfrag thresholds are not met. > > > > > > > > > > There is still a risk that if a system is completely fragmented > > > that it may > > > consume CPU on pointless compaction cycles. This is why > > > compaction from > > > kernel thread context makes no special effort and bails > > > relatively quickly and > > > assumes that if an application really needs high-order pages that > > > it'll incur > > > the cost at allocation time. > > > > > > > As data in Table-1 shows, on-demand compaction can add high latency > > to > > every single allocation. I think it would be a significant > > improvement (see > > Table-2) to at least expose an interface to allow proactive > > compaction > > (like compaction_extfrag), which a driver can itself run in > > background. This > > way, we need not add any tunables to the kernel itself and leave > > compaction > > decision to specialized kernel/userspace monitors. > > > > I do not have any major objection -- again, it's not that dissimilar > to > compact_memory (although that was intended as a debugging interface). > Yes, the only difference is I want to stop compaction compaction till we hit the given extfrag level. > > > > This patch is largely based on ideas from Michal Hocko posted > > > > here: > > > > https://lore.kernel.org/linux- > > > mm/20161230131412.GI13301@xxxxxxxxxxxxxx > > > > / > > > > > > > > Testing done (on x86): > > > > - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = > > > > {25, 30} > > > > respectively. > > > > - Use a test program to fragment memory: the program allocates > > > > all > > > > memory and then for each 2M aligned section, frees 3/4 of base > > > > pages > > > > using munmap. > > > > - kcompactd0 detects fragmentation for order-9 > extfrag_high > > > > and > > > > starts compaction till extfrag < extfrag_low for order-9. > > > > > > > > > > This is a somewhat optimisitic allocation scenario. The > > > interesting ones are > > > when a system is fragmenteed in a manner that is not trivial to > > > resolve -- e.g. > > > after a prolonged period of time with unmovable/reclaimable > > > allocations > > > stealing pageblocks. It's also fairly difficult to analyse if > > > this is helping > > > because you cannot measure after the fact how much time was saved > > > in > > > allocation time due to the work done by kcompactd. It is also > > > hard to > > > determine if the sum of the stalls incurred by proactive > > > compaction is lower > > > than the time saved at allocation time. > > > > > > I fear that the user-visible effect will be times when there are > > > very short but > > > numerous stalls due to proactive compaction running in the > > > background that > > > will be hard to detect while the benefits may be invisible. > > > > > > > Pro-active compaction can be done in a non-time-critical context, > > so to > > estimate its benefits we can just compare data from Table-1 the > > same run, > > under a similar fragmentation state, but with this patch applied: > > > > How do you define what a non-time-critical context is? Once > compaction > starts, an applications data becomes temporarily unavailable during > migration. By time-critical-context I roughly mean contexts where hugepage allocations are triggered in response to a user action and any delay here would be directly noticable by the user. Compare this scenario with a backround thread doing compaction: this activity can appear as random freezes for running applications. Whether this effect on unrelated applications is acceptable or not can be left to user of this new compaction interface. > > > Table-2: hugepage allocation latencies with this patch applied on > > 5.3.0-rc5. > > > > > GFP_Stat | Any | Light | Fallback | > > > --------:| ----------:| ---------:| ----------:| > > > count | 12197.0 | 11167.0 | 1030.0 | > > > min | 2.0 | 2.0 | 5.0 | > > > max | 361727.0 | 26.0 | 361727.0 | > > > mean | 366.05 | 4.48 | 4286.13 | > > > stddev | 4575.53 | 1.41 | 15209.63 | > > > > We can see that mean latency dropped to 366us compared with 5494us > > before. > > > > This is an optimistic scenario where there was a little mix of > > unmovable > > pages but still the data shows that in case compaction can succeed, > > pro-active compaction can give signification reduction higher-order > > allocation latencies. > > > > Which still does not address the point that reducing compaction > overhead > is generally beneficial without incurring additional overhead to > unrelated applications. > Yes, reducing compaction latency is always beneficial especially if it can be done in a way not to touch (hot) pages from unrelated applications. Even with good improvements in this area, proactive compaction would still be good to have. > I'm not against the use of an interface because it requires an > application > to make a deliberate choice and understand the downsides which can be > documented. An automatic proactive compaction may impact users that > have > no idea the feature even exists. > I'm now dropping the idea of exposing per-order extfrag thresholds and would now focus on an interface to compact memory to reach a given extfrag level instead. Thanks, Nitin