[RFC PATCH 0/2] Removal of lumpy reclaim

Mel Gorman <mgorman@xxxxxxx> · Wed, 28 Mar 2012 17:06:21 +0100

(cc'ing active people in the thread "[patch 68/92] mm: forbid lumpy-reclaim
in shrink_active_list()")

In the interest of keeping my fingers from the flames at LSF/MM, I'm
releasing an RFC for lumpy reclaim removal. The first patch removes removes
lumpy reclaim itself and the second removes reclaim_mode_t. They can be
merged together but the resulting patch is harder to review.

The patches are based on commit e22057c8599373e5caef0bc42bdb95d2a361ab0d
which is after Andrew's tree was merged but before 3.4-rc1 is released.

Roughly 1K of text is removed, over 200 lines of code and struct scan_control
is smaller.

   text	   data	    bss	    dec	    hex	filename
6723455	1931304	2260992	10915751	 a68fa7	vmlinux-3.3.0-git
6722303	1931304	2260992	10914599	 a68b27	vmlinux-3.3.0-lumpyremove-v1

There are behaviour changes caused by the series with details in the
patches themselves. I ran some preliminary tests but coverage is shaky
due to time constraints. The kernels tested were

3.2.0		  Vanilla 3.2.0 kernel
3.3.0-git	  Commit e22057c which will be part of 3.4-rc1
3.3.0-lumpyremove These two patches

fs-mark running in a threaded configuration showed nothing useful

postmark had interesting results. I know postmark is not very useful
as a mail server benchmark but it pushes page reclaim in a manner that
is useful from a testing perspective. Regressions in page reclaim can
result in regressions in postmark when the WSS for postmark is larger than
physical memory.

POSTMARK
                                        3.2.0-vanilla     3.3.0-git      lumpyremove-v1r3
Transactions per second:               16.00 ( 0.00%)    19.00 (18.75%)    19.00 (18.75%)
Data megabytes read per second:        18.62 ( 0.00%)    23.18 (24.49%)    22.56 (21.16%)
Data megabytes written per second:     35.49 ( 0.00%)    44.18 (24.49%)    42.99 (21.13%)
Files created alone per second:        26.00 ( 0.00%)    35.00 (34.62%)    34.00 (30.77%)
Files create/transact per second:       8.00 ( 0.00%)     9.00 (12.50%)     9.00 (12.50%)
Files deleted alone per second:       680.00 ( 0.00%)  6124.00 (800.59%)  2041.00 (200.15%)
Files delete/transact per second:       8.00 ( 0.00%)     9.00 (12.50%)     9.00 (12.50%)

MMTests Statistics: duration
Sys Time Running Test (seconds)             119.61    111.16    111.40
User+Sys Time Running Test (seconds)        153.19    144.13    143.29
Total Elapsed Time (seconds)               1171.34    940.97    966.97

MMTests Statistics: vmstat
Page Ins                                    13797412    13734736    13731792
Page Outs                                   43284036    42959856    42744668
Swap Ins                                        7751           0           0
Swap Outs                                       9617           0           0
Direct pages scanned                          334395           0           0
Kswapd pages scanned                         9664358     9933599     9929577
Kswapd pages reclaimed                       9621893     9932913     9928893
Direct pages reclaimed                        334395           0           0
Kswapd efficiency                                99%         99%         99%
Kswapd velocity                             8250.686   10556.765   10268.754
Direct efficiency                               100%        100%        100%
Direct velocity                              285.481       0.000       0.000
Percentage direct scans                           3%          0%          0%
Page writes by reclaim                          9619           0           0
Page writes file                                   2           0           0
Page writes anon                                9617           0           0
Page reclaim immediate                             7           0           0
Page rescued immediate                             0           0           0
Slabs scanned                                  38912       38912       38912
Direct inode steals                                0           0           0
Kswapd inode steals                           154304      160972      158444
Kswapd skipped wait                                0           0           0
THP fault alloc                                    4           4           4
THP collapse alloc                                 0           0           0
THP splits                                         3           0           0
THP fault fallback                                 0           0           0
THP collapse fail                                  0           0           0
Compaction stalls                                  1           0           0
Compaction success                                 1           0           0
Compaction failures                                0           0           0
Compaction pages moved                             0           0           0
Compaction move failure                            0           0           0

It looks like 3.3.0-git is better in general although that "Files deleted
alone per second" looks like an anomaly. Removing lumpy reclaim fully
affects things a bit but not enough to be of concern as monitoring was
running at the same time which disrupts results. Dirty pages were not
being encountered at the end of the LRU so the behaviour change related
to THP allocations stalling on dirty pages would not be triggered.

Note that swap in/out, direct reclaim and page writes from reclaim dropped
to 0 between 3.2.0 and 3.3.0-git. According to a range of results I have
for mainline kernels between 2.6.32 and 3.3.0 on a different machine, this
swap in/out and direct reclaim problem was introduced after 3.0 and fixed
by 3.3.0 with 3.1.x and 3.2.x both showing swap in/out, direct reclaim
and page writes from reclaim. If I had to guess, it was fixed by commits
e0887c19, fe4b1b24 and 0cee34fd but I did not double check[1].

Removing direct reclaim does not make an obvious difference but note that
THP was barely used at all in this benchmark. Benchmarks that stress both
page reclaim and THP at the same time in a meaningful manner are thin on
the ground.

A benchmark that DD writes a large file also showed nothing interesting
but I was not really expecting it to. The test looks for problems related to
a large linear writer and removing lumpy reclaim was unlikely to affect it.

I ran a benchmark that stressed high-order allocation. This is very
artifical load but was used in the past to evaluate lumpy reclaim and
compaction. Generally I look at allocation success rates and latency figures.

STRESS-HIGHALLOC
                 3.2.0-vanilla     3.3.0-git        lumpyremove-v1r3
Pass 1          82.00 ( 0.00%)    27.00 (-55.00%)    32.00 (-50.00%)
Pass 2          70.00 ( 0.00%)    37.00 (-33.00%)    40.00 (-30.00%)
while Rested    90.00 ( 0.00%)    88.00 (-2.00%)    88.00 (-2.00%)

MMTests Statistics: duration
Sys Time Running Test (seconds)             735.12    688.13    683.91
User+Sys Time Running Test (seconds)       2764.46   3278.45   3271.41
Total Elapsed Time (seconds)               1204.41   1140.29   1137.58

MMTests Statistics: vmstat
Page Ins                                     5426648     2840348     2695120
Page Outs                                    7206376     7854516     7860408
Swap Ins                                       36799           0           0
Swap Outs                                      76903           4           0
Direct pages scanned                           31981       43749      160647
Kswapd pages scanned                        26658682     1285341     1195956
Kswapd pages reclaimed                       2248583     1271621     1178420
Direct pages reclaimed                          6397       14416       94093
Kswapd efficiency                                 8%         98%         98%
Kswapd velocity                            22134.225    1127.205    1051.316
Direct efficiency                                20%         32%         58%
Direct velocity                               26.553      38.367     141.218
Percentage direct scans                           0%          3%         11%
Page writes by reclaim                       6530481           4           0
Page writes file                             6453578           0           0
Page writes anon                               76903           4           0
Page reclaim immediate                        256742       17832       61576
Page rescued immediate                             0           0           0
Slabs scanned                                1073152      971776      975872
Direct inode steals                                0      196279      205178
Kswapd inode steals                           139260       70390       64323
Kswapd skipped wait                            21711           1           0
THP fault alloc                                    1         126         143
THP collapse alloc                               324         294         224
THP splits                                        32           8          10
THP fault fallback                                 0           0           0
THP collapse fail                                  5           6           7
Compaction stalls                                364        1312        1324
Compaction success                               255         343         366
Compaction failures                              109         969         958
Compaction pages moved                        265107     3952630     4489215
Compaction move failure                         7493       26038       24739

Success rates are completely hosed for 3.4-rc1 which is almost certainly
due to [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. I
expected this would happen for kswapd and impair allocation success rates
(https://lkml.org/lkml/2012/1/25/166) but I did not anticipate this much
a difference: 95% less scanning, 43% less reclaim by kswapd

In comparison, reclaim/compaction is not aggressive and gives up easily
which is the intended behaviour. hugetlbfs uses __GFP_REPEAT and would be
much more aggressive about reclaim/compaction than THP allocations are. The
stress test above is allocating like neither THP or hugetlbfs but is much
closer to THP.

Mainline is now impared in terms of high order allocation under heavy load
although I do not know to what degree as I did not test with __GFP_REPEAT.
Still, keep it in mind for bugs related to hugepage pool resizing, THP
allocation and high order atomic allocation failures from network devices.

Despite this, I think we should merge the patches in this series. The
stress tests were very useful when the main user was hugetlb pool resizing
and when rattling out bugs in memory compaction but are now too unrealistic
to draw solid conclusions from. They need to be replaced but that should
not delay the lumpy reclaim removal.

I'd appreciate it people took a look at the patches and see if there was
anything I missed.

[1] Where are these results you say? They are generated using MM Tests to
    see what negative trends could be identified. They are still in the
    process of running. I've had limited time to dig through the data.

 include/trace/events/vmscan.h |   40 ++-----
 mm/vmscan.c                   |  263 ++++-------------------------------------
 2 files changed, 37 insertions(+), 266 deletions(-)

-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>