Page reclamation and compaction is triggered in response to reaching low watermark. This makes reclamation/compaction reactive based upon a snapshot of the system at a point in time. When that point is reached, system is already suffering from free memory shortage and must now try to recover. Recovery can often land system in direct reclamation/compaction path and while recovery happens, workloads start to experience unpredictable memory allocation latencies. In real life, forced direct reclamation has been seen to cause sudden spike in time it takes to populate a new database or an extraordinary unpredictable latency in launching a new server on cloud platform. These events create SLA violations which are expensive for businesses. If the kernel could foresee a potential free page exhaustion or fragmentation event well before it happens, it could start reclamation proactively instead to avoid allocation stalls. A time based trend line for available free pages can show such potential future events by charting the current memory consumption trend on the system. These patches propose a way to capture enough memory usage information to compute a trend line based upon most recent data. Trend line is graphed with x-axis showing time and y-axis showing number of free pages. The proposal is to capture the number of free pages at opportune moments along with the current timestamp. Once system has enough data points (the lookback window for trend analysis), fit a line of the form y=mx+c to these points using least sqaure regression method. As time advances, these points can be updated with new data points and a new best fit line can be computed. Capturing these data points and computing trend line for pages of order 0-MAX_ORDER allows us to not only foresee free pages exhaustion point but also severe fragmentation points in future. If the line representing trend for total free pages has a negative slope (hence trending downward), solving y=mx+c for x with y=0 tells us if the current trend continues, at what point would the system run out of free pages. If average rate of page reclamation is computed by observing page reclamation behavior, that information can be used to compute the time to start reclamation at so that number of free pages does not fall to 0 or below low watermark if current memory consumption trend were to continue. Similarly, if kernel tracks the level of fragmentation for each order page (which can be done by computing the number of free pages below this order), a trend line for each order can be used to compute the point in time when no more pages of that order will be available for allocation. If the trend line represents number of unusable pages for that order, the intersection of this line with line representing number of free pages is the point of 100% fragmentation. This holds true because at this intersection point all free pages are of lower order. Intersetion point for two lines y0=m0x0+c0 and y1=m1x1+c1 can be computed mathematically which yields x and y coordinates on time and free pages graph. If average rate of compaction is computed by timing previous compaction runs, kernel can compute how soon does it need to start compaction to avoid this 100% fragmentation point. Patch 1 adds code to maintain a sliding lookback window of (time, number of free pages) points which can be updated continuously and adds code to compute best fit line across these points. It also adds code to use the best fit lines to determine if kernel must start reclamation or compaction. Patch 2 adds code to collect data points on free pages of various orders at different points in time, uses code in patch 1 to update sliding lookback window with these points and kicks off reclamation or compaction based upon the results it gets. Patch 1 maintains a fixed size lookback window. A fixed size lookback window limits the amount of data that has to be maintained to compute a best fit line. Routine mem_predict() in patch 1 uses best fit line to determine the immediate need for reclamation or compaction. To simplify initial concept implementation, it uses a fixed time threshold when compaction should start in anticipation of impending fragmentation. Similarly it uses a fixed minimum precentage free pages as criteria to detrmine if it is time to start reclamation if the current trend line shows continued drop in number of free pages. Both of these criteria can be improved upon in final implementation by taking rate of compaction and rate of reclamation into account. Patch 2 collects data points for best fit line in kswapd before we decide if kswapd should go to sleep or continue reclamation. It then uses that data to delay kswapd from sleeping and continue reclamation. Potential fragmentation information obtained from best fit line is used to decide if zone watermark should be boosted to avert impending fragmentation. This data is also used in balance_pgdat() to determine if kcompatcd should be woken up to start compaction. get_page_from_freelist() might be a better place to gather data points and make decision on starting reclamation or comapction but it can also impact page allocation latency. Another possibility is to create a separate kernel thread that gathers page usage data periodically and wakes up kswapd or kcompactd as needed based upon trend analysis. This is something that can be finalized before final implementation of this proposal. Impact of this implementation was measured using two sets of tests. First test consists of three concurrent dd processes writing large amounts of data (66 GB, 131 GB and 262 GB) to three different SSDs causing large number of free pages to be used up for buffer/page cache. Number of cumulative allocation stalls as reported by /proc/vmstat were recorded for 5 runs of this test. 5.3-rc2 ------- allocstall_dma 0 allocstall_dma32 0 allocstall_normal 15 allocstall_movable 1629 compact_stall 0 Total = 1644 5.3-rc2 + this patch series --------------------------- allocstall_dma 0 allocstall_dma32 0 allocstall_normal 182 allocstall_movable 1266 compact_stall 0 Total = 1544 There was no significant change in system time between these runs. This was a ~6.5% improvement in number of allocation stalls. A scond test used was the parallel dd test from mmtests. Average number of stalls over 4 runs with unpatched 5.3-rc2 kernel was 6057. Average number of stalls over 4 runs after applying these patches was 5584. This was an ~8% improvement in number of allocation stalls. This work is complementary to other allocation/compaction stall improvements. It attempts to address potential stalls proactively before they happen and will make use of any improvements made to the reclamation/compaction code. Any feedback on this proposal and associated implementation will be greatly appreciated. This is work in progress. Khalid Aziz (2): mm: Add trend based prediction algorithm for memory usage mm/vmscan: Add fragmentation prediction to kswapd include/linux/mmzone.h | 72 +++++++++++ mm/Makefile | 2 +- mm/lsq.c | 273 +++++++++++++++++++++++++++++++++++++++++ mm/page_alloc.c | 27 ---- mm/vmscan.c | 116 ++++++++++++++++- 5 files changed, 456 insertions(+), 34 deletions(-) create mode 100644 mm/lsq.c -- 2.20.1