[RFC PATCH 0/2] Add predictive memory reclamation and compaction

Khalid Aziz <khalid.aziz@xxxxxxxxxx> · Mon, 12 Aug 2019 19:40:10 -0600

Page reclamation and compaction is triggered in response to reaching low
watermark. This makes reclamation/compaction reactive based upon a
snapshot of the system at a point in time. When that point is reached,
system is already suffering from free memory shortage and must now try
to recover. Recovery can often land system in direct
reclamation/compaction path and while recovery happens, workloads start
to experience unpredictable memory allocation latencies. In real life,
forced direct reclamation has been seen to cause sudden spike in time it
takes to populate a new database or an extraordinary unpredictable
latency in launching a new server on cloud platform. These events create
SLA violations which are expensive for businesses.

If the kernel could foresee a potential free page exhaustion or
fragmentation event well before it happens, it could start reclamation
proactively instead to avoid allocation stalls. A time based trend line
for available free pages can show such potential future events by
charting the current memory consumption trend on the system.

These patches propose a way to capture enough memory usage information
to compute a trend line based upon most recent data. Trend line is
graphed with x-axis showing time and y-axis showing number of free
pages. The proposal is to capture the number of free pages at opportune
moments along with the current timestamp. Once system has enough data
points (the lookback window for trend analysis), fit a line of the form
y=mx+c to these points using least sqaure regression method.  As time
advances, these points can be updated with new data points and a new
best fit line can be computed. Capturing these data points and computing
trend line for pages of order 0-MAX_ORDER allows us to not only foresee
free pages exhaustion point but also severe fragmentation points in
future.

If the line representing trend for total free pages has a negative slope
(hence trending downward), solving y=mx+c for x with y=0 tells us if
the current trend continues, at what point would the system run out of
free pages. If average rate of page reclamation is computed by observing
page reclamation behavior, that information can be used to compute the
time to start reclamation at so that number of free pages does not fall
to 0 or below low watermark if current memory consumption trend were to
continue.

Similarly, if kernel tracks the level of fragmentation for each order
page (which can be done by computing the number of free pages below this
order), a trend line for each order can be used to compute the point in
time when no more pages of that order will be available for allocation.
If the trend line represents number of unusable pages for that order,
the intersection of this line with line representing number of free
pages is the point of 100% fragmentation. This holds true because at
this intersection point all free pages are of lower order. Intersetion
point for two lines y0=m0x0+c0 and y1=m1x1+c1 can be computed
mathematically which yields x and y coordinates on time and free pages
graph. If average rate of compaction is computed by timing previous
compaction runs, kernel can compute how soon does it need to start
compaction to avoid this 100% fragmentation point.

Patch 1 adds code to maintain a sliding lookback window of (time, number
of free pages) points which can be updated continuously and adds code to
compute best fit line across these points. It also adds code to use the
best fit lines to determine if kernel must start reclamation or
compaction.

Patch 2 adds code to collect data points on free pages of various orders
at different points in time, uses code in patch 1 to update sliding
lookback window with these points and kicks off reclamation or
compaction based upon the results it gets.

Patch 1 maintains a fixed size lookback window. A fixed size lookback
window limits the amount of data that has to be maintained to compute a
best fit line. Routine mem_predict() in patch 1 uses best fit line to
determine the immediate need for reclamation or compaction. To simplify
initial concept implementation, it uses a fixed time threshold when
compaction should start in anticipation of impending fragmentation.
Similarly it uses a fixed minimum precentage free pages as criteria to
detrmine if it is time to start reclamation if the current trend line
shows continued drop in number of free pages. Both of these criteria can
be improved upon in final implementation by taking rate of compaction
and rate of reclamation into account.

Patch 2 collects data points for best fit line in kswapd before we
decide if kswapd should go to sleep or continue reclamation. It then
uses that data to delay kswapd from sleeping and continue reclamation.
Potential fragmentation information obtained from best fit line is used
to decide if zone watermark should be boosted to avert impending
fragmentation. This data is also used in balance_pgdat() to determine if
kcompatcd should be woken up to start compaction.
get_page_from_freelist() might be a better place to gather data points
and make decision on starting reclamation or comapction but it can also
impact page allocation latency. Another possibility is to create a
separate kernel thread that gathers page usage data periodically and
wakes up kswapd or kcompactd as needed based upon trend analysis. This
is something that can be finalized before final implementation of this
proposal.

Impact of this implementation was measured using two sets of tests.
First test consists of three concurrent dd processes writing large
amounts of data (66 GB, 131 GB and 262 GB) to three different SSDs
causing large number of free pages to be used up for buffer/page cache.
Number of cumulative allocation stalls as reported by /proc/vmstat were
recorded for 5 runs of this test.

5.3-rc2
-------

allocstall_dma 0
allocstall_dma32 0
allocstall_normal 15
allocstall_movable 1629
compact_stall 0

Total = 1644

5.3-rc2 + this patch series
---------------------------

allocstall_dma 0
allocstall_dma32 0
allocstall_normal 182
allocstall_movable 1266
compact_stall 0

Total = 1544

There was no significant change in system time between these runs. This
was a ~6.5% improvement in number of allocation stalls.

A scond test used was the parallel dd test from mmtests. Average number
of stalls over 4 runs with unpatched 5.3-rc2 kernel was 6057. Average
number of stalls over 4 runs after applying these patches was 5584. This
was an ~8% improvement in number of allocation stalls.

This work is complementary to other allocation/compaction stall
improvements. It attempts to address potential stalls proactively before
they happen and will make use of any improvements made to the
reclamation/compaction code.

Any feedback on this proposal and associated implementation will be
greatly appreciated. This is work in progress.

Khalid Aziz (2):
  mm: Add trend based prediction algorithm for memory usage
  mm/vmscan: Add fragmentation prediction to kswapd

 include/linux/mmzone.h |  72 +++++++++++
 mm/Makefile            |   2 +-
 mm/lsq.c               | 273 +++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c        |  27 ----
 mm/vmscan.c            | 116 ++++++++++++++++-
 5 files changed, 456 insertions(+), 34 deletions(-)
 create mode 100644 mm/lsq.c

-- 
2.20.1