Re: [RFC] mm: Proactive compaction

Nitin Gupta <nigupta@xxxxxxxxxx> · Tue, 27 Aug 2019 20:36:39 +0000

On Mon, 2019-08-26 at 12:47 +0100, Mel Gorman wrote:
> On Thu, Aug 22, 2019 at 09:57:22PM +0000, Nitin Gupta wrote:
> > > Note that proactive compaction may reduce allocation latency but
> > > it is not
> > > free either. Even though the scanning and migration may happen in
> > > a kernel
> > > thread, tasks can incur faults while waiting for compaction to
> > > complete if the
> > > task accesses data being migrated. This means that costs are
> > > incurred by
> > > applications on a system that may never care about high-order
> > > allocation
> > > latency -- particularly if the allocations typically happen at
> > > application
> > > initialisation time.  I recognise that kcompactd makes a bit of
> > > effort to
> > > compact memory out-of-band but it also is typically triggered in
> > > response to
> > > reclaim that was triggered by a high-order allocation request.
> > > i.e. the work
> > > done by the thread is triggered by an allocation request that hit
> > > the slow
> > > paths and not a preemptive measure.
> > > 
> > 
> > Hitting the slow path for every higher-order allocation is a
> > signification
> > performance/latency issue for applications that requires a large
> > number of
> > these allocations to succeed in bursts. To get some concrete
> > numbers, I
> > made a small driver that allocates as many hugepages as possible
> > and
> > measures allocation latency:
> > 
> 
> Every higher-order allocation does not necessarily hit the slow path
> nor
> does it incur equal latency.

I did not mean *every* hugepage allocation in a literal sense.
I meant to say: higher order allocation *tend* to hit slow path
with a high probability under reasonably fragmented memory state
and when they do, they incur high latency.

> 
> > The driver first tries to allocate hugepage using
> > GFP_TRANSHUGE_LIGHT
> > (referred to as "Light" in the table below) and if that fails,
> > tries to
> > allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as
> > "Fallback" in table below). We stop the allocation loop if both
> > methods
> > fail.
> > 
> > Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All
> > latencies
> > are in microsec.
> > 
> > > GFP/Stat |        Any |   Light |   Fallback |
> > > --------: | ---------: | ------: | ---------: |
> > >    count |       9908 |     788 |       9120 |
> > >      min |        0.0 |     0.0 |     1726.0 |
> > >      max |   135387.0 |   142.0 |   135387.0 |
> > >     mean |    5494.66 |    1.83 |    5969.26 |
> > >   stddev |   21624.04 |    7.58 |   22476.06 |
> 
> Given that it is expected that there would be significant tail
> latencies,
> it would be better to analyse this in terms of percentiles. A very
> small
> number of high latency allocations would skew the mean significantly
> which is hinted by the stddev.
> 

Here is the same data in terms of percentiles:

- with vanilla kernel 5.3.0-rc5:

percentile latency
–––––––––– –––––––
         5       1
        10    179
0
        25    1829
        30    1838
        40    1854
        50    18
71
        60    1890
        75    1924
        80    1945
        90    2
206
        95    2302

- Now with kernel 5.3.0-rc5 + this patch:

percentile latency
–––––––––– –––––––
         5       3
        10       4
        25       
4
        30       4
        40       4
        50       4
        60      
 4
        75       5
        80       5
        90       9
        95    1
154

> > As you can see, the mean and stddev of allocation is extremely high
> > with
> > the current approach of on-demand compaction.
> > 
> > The system was fragmented from a userspace program as I described
> > in this
> > patch description. The workload is mainly anonymous userspace pages
> > which
> > as easy to move around. I intentionally avoided unmovable pages in
> > this
> > test to see how much latency do we incur just by hitting the slow
> > path for
> > a majority of allocations.
> > 
> 
> Even though, the penalty for proactive compaction is that
> applications
> that may have no interest in higher-order pages may still stall while
> their data is migrated if the data is hot. This is why I think the
> focus
> should be on reducing the latency of compaction -- it benefits
> applications that require higher-order latencies without increasing
> the
> overhead for unrelated applications.
> 

Sure, reducing compaction latency would help but there should still
be an option to proactively compact to hide latencies further.

> > > > For a more proactive compaction, the approach taken here is to
> > > > define
> > > > per page-order external fragmentation thresholds and let
> > > > kcompactd
> > > > threads act on these thresholds.
> > > > 
> > > > The low and high thresholds are defined per page-order and
> > > > exposed
> > > > through sysfs:
> > > > 
> > > >   /sys/kernel/mm/compaction/order-
> > > > [1..MAX_ORDER]/extfrag_{low,high}
> > > > 
> > > 
> > > These will be difficult for an admin to tune that is not
> > > extremely familiar with
> > > how external fragmentation is defined. If an admin asked "how
> > > much will
> > > stalls be reduced by setting this to a different value?", the
> > > answer will always
> > > be "I don't know, maybe some, maybe not".
> > > 
> > 
> > Yes, this is my main worry. These values can be set to emperically
> > determined values on highly specialized systems like database
> > appliances.
> > However, on a generic system, there is no real reasonable value.
> > 
> 
> Yep, which means the tunable will be vulnerable to cargo-cult tuning
> recommendations. Or worse, the tuning recommendation will be a flat
> "disable THP".
> 

I thought more on this and yes, exposing a system wide per-order
extfrag threshold may not be the best approach. Instead, expose a
specific interface to compact a zone to a specified level and leave the
policy on when to trigger (based on extfrag levels, system load etc.)
upto the user (kernel driver or userspace daemon).

> > Still, at the very least, I would like an interface that allows
> > compacting
> > system to a reasonable state. Something like:
> > 
> >     compact_extfrag(node, zone, order, high, low)
> > 
> > which start compaction if extfrag > high, and goes on till extfrag
> > < low.
> > 
> > It's possible that there are too many unmovable pages mixed around
> > for
> > compaction to succeed, still it's a reasonable interface to expose
> > rather
> > than forced on-demand style of compaction (please see data below).
> > 
> > How (and if) to expose it to userspace (sysfs etc.) can be a
> > separate
> > discussion.
> > 
> 
> That would be functionally similar to vm.compact_memory although it
> would either need an extension or a separate tunable. With sysfs,
> there
> could be a per-node file that takes with a watermark and order tuple
> to
> trigger the interface.
> 

Something like:
    /sys/kernel/mm/node-n/compact
or, /sys/kernel/mm/compact-n
    where n in [0, NUM_NODES],

which takes tuple watermark and order, should do?

I'm also okay not adding any of these sysfs interface for now.

> > > > Per-node kcompactd thread is woken up every few seconds to
> > > > check if
> > > > any zone on its node has extfrag above the extfrag_high
> > > > threshold for
> > > > any order, in which case the thread starts compaction in the
> > > > backgrond
> > > > till all zones are below extfrag_low level for all orders. By
> > > > default
> > > > both these thresolds are set to 100 for all orders which
> > > > essentially
> > > > disables kcompactd.
> > > > 
> > > > To avoid wasting CPU cycles when compaction cannot help, such
> > > > as when
> > > > memory is full, we check both, extfrag > extfrag_high and
> > > > compaction_suitable(zone). This allows kcomapctd thread to
> > > > stays
> > > > inactive even if extfrag thresholds are not met.
> > > > 
> > > 
> > > There is still a risk that if a system is completely fragmented
> > > that it may
> > > consume CPU on pointless compaction cycles. This is why
> > > compaction from
> > > kernel thread context makes no special effort and bails
> > > relatively quickly and
> > > assumes that if an application really needs high-order pages that
> > > it'll incur
> > > the cost at allocation time.
> > > 
> > 
> > As data in Table-1 shows, on-demand compaction can add high latency
> > to
> > every single allocation. I think it would be a significant
> > improvement (see
> > Table-2) to at least expose an interface to allow proactive
> > compaction
> > (like compaction_extfrag), which a driver can itself run in
> > background. This
> > way, we need not add any tunables to the kernel itself and leave
> > compaction
> > decision to specialized kernel/userspace monitors.
> > 
> 
> I do not have any major objection -- again, it's not that dissimilar
> to
> compact_memory (although that was intended as a debugging interface).
> 

Yes, the only difference is I want to stop compaction compaction till
we hit the given extfrag level.

> > > > This patch is largely based on ideas from Michal Hocko posted
> > > > here:
> > > > https://lore.kernel.org/linux-
> > > mm/20161230131412.GI13301@xxxxxxxxxxxxxx
> > > > /
> > > > 
> > > > Testing done (on x86):
> > > >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} =
> > > > {25, 30}
> > > > respectively.
> > > >  - Use a test program to fragment memory: the program allocates
> > > > all
> > > > memory  and then for each 2M aligned section, frees 3/4 of base
> > > > pages
> > > > using  munmap.
> > > >  - kcompactd0 detects fragmentation for order-9 > extfrag_high
> > > > and
> > > > starts  compaction till extfrag < extfrag_low for order-9.
> > > > 
> > > 
> > > This is a somewhat optimisitic allocation scenario. The
> > > interesting ones are
> > > when a system is fragmenteed in a manner that is not trivial to
> > > resolve -- e.g.
> > > after a prolonged period of time with unmovable/reclaimable
> > > allocations
> > > stealing pageblocks. It's also fairly difficult to analyse if
> > > this is helping
> > > because you cannot measure after the fact how much time was saved
> > > in
> > > allocation time due to the work done by kcompactd. It is also
> > > hard to
> > > determine if the sum of the stalls incurred by proactive
> > > compaction is lower
> > > than the time saved at allocation time.
> > > 
> > > I fear that the user-visible effect will be times when there are
> > > very short but
> > > numerous stalls due to proactive compaction running in the
> > > background that
> > > will be hard to detect while the benefits may be invisible.
> > > 
> > 
> > Pro-active compaction can be done in a non-time-critical context,
> > so to
> > estimate its benefits we can just compare data from Table-1 the
> > same run,
> > under a similar fragmentation state, but with this patch applied:
> > 
> 
> How do you define what a non-time-critical context is? Once
> compaction
> starts, an applications data becomes temporarily unavailable during
> migration.

By time-critical-context I roughly mean contexts where hugepage
allocations are triggered in response to a user action and any delay
here would be directly noticable by the user. Compare this scenario
with a backround thread doing compaction: this activity can appear
as random freezes for running applications. Whether this
effect on unrelated applications is acceptable or not can be left
to user of this new compaction interface.

> 
> > Table-2: hugepage allocation latencies with this patch applied on
> > 5.3.0-rc5.
> > 
> > > GFP_Stat |        Any |     Light |   Fallback |
> > > --------:| ----------:| ---------:| ----------:|
> > >   count  |   12197.0  |  11167.0  |    1030.0  |
> > >     min  |       2.0  |      2.0  |       5.0  |
> > >     max  |  361727.0  |     26.0  |  361727.0  |
> > >    mean  |    366.05  |     4.48  |   4286.13  |
> > >   stddev |   4575.53  |     1.41  |  15209.63  |
> > 
> > We can see that mean latency dropped to 366us compared with 5494us
> > before.
> > 
> > This is an optimistic scenario where there was a little mix of
> > unmovable
> > pages but still the data shows that in case compaction can succeed,
> > pro-active compaction can give signification reduction higher-order
> > allocation latencies.
> > 
> 
> Which still does not address the point that reducing compaction
> overhead
> is generally beneficial without incurring additional overhead to
> unrelated applications.
> 

Yes, reducing compaction latency is always beneficial especially if
it can be done in a way not to touch (hot) pages from unrelated
applications. 
Even with good improvements in this area, proactive compaction would
still be good to have.

> I'm not against the use of an interface because it requires an
> application
> to make a deliberate choice and understand the downsides which can be
> documented. An automatic proactive compaction may impact users that
> have
> no idea the feature even exists.
> 

I'm now dropping the idea of exposing per-order extfrag thresholds and
would now focus on an interface to compact memory to reach a given
extfrag level instead.

Thanks,
Nitin