On Tue, 10 Jan 2012 16:02:52 +0100 Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > Right now, memcg soft limits are implemented by having a sorted tree > of memcgs that are in excess of their limits. Under global memory > pressure, kswapd first reclaims from the biggest excessor and then > proceeds to do regular global reclaim. The result of this is that > pages are reclaimed from all memcgs, but more scanning happens against > those above their soft limit. > > With global reclaim doing memcg-aware hierarchical reclaim by default, > this is a lot easier to implement: everytime a memcg is reclaimed > from, scan more aggressively (per tradition with a priority of 0) if > it's above its soft limit. With the same end result of scanning > everybody, but soft limit excessors a bit more. > > Advantages: > > o smoother reclaim: soft limit reclaim is a separate stage before > global reclaim, whose result is not communicated down the line and > so overreclaim of the groups in excess is very likely. After this > patch, soft limit reclaim is fully integrated into regular reclaim > and each memcg is considered exactly once per cycle. > > o true hierarchy support: soft limits are only considered when > kswapd does global reclaim, but after this patch, targetted > reclaim of a memcg will mind the soft limit settings of its child > groups. > > o code size: soft limit reclaim requires a lot of code to maintain > the per-node per-zone rb-trees to quickly find the biggest > offender, dedicated paths for soft limit reclaim etc. while this > new implementation gets away without all that. > > Test: > > The test consists of two concurrent kernel build jobs in separate > source trees, the master and the slave. The two jobs get along nicely > on 600MB of available memory, so this is the zero overcommit control > case. When available memory is decreased, the overcommit is > compensated by decreasing the soft limit of the slave by the same > amount, in the hope that the slave takes the hit and the master stays > unaffected. > > 600M-0M-vanilla 600M-0M-patched > Master walltime (s) 552.65 ( +0.00%) 552.38 ( -0.05%) > Master walltime (stddev) 1.25 ( +0.00%) 0.92 ( -14.66%) > Master major faults 204.38 ( +0.00%) 205.38 ( +0.49%) > Master major faults (stddev) 27.16 ( +0.00%) 13.80 ( -47.43%) > Master reclaim 31.88 ( +0.00%) 37.75 ( +17.87%) > Master reclaim (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > Master scan 31.88 ( +0.00%) 37.75 ( +17.87%) > Master scan (stddev) 34.01 ( +0.00%) 75.88 (+119.59%) > Master kswapd reclaim 33922.12 ( +0.00%) 33887.12 ( -0.10%) > Master kswapd reclaim (stddev) 969.08 ( +0.00%) 492.22 ( -49.16%) > Master kswapd scan 34085.75 ( +0.00%) 33985.75 ( -0.29%) > Master kswapd scan (stddev) 1101.07 ( +0.00%) 563.33 ( -48.79%) > Slave walltime (s) 552.68 ( +0.00%) 552.12 ( -0.10%) > Slave walltime (stddev) 0.79 ( +0.00%) 1.05 ( +14.76%) > Slave major faults 212.50 ( +0.00%) 204.50 ( -3.75%) > Slave major faults (stddev) 26.90 ( +0.00%) 13.17 ( -49.20%) > Slave reclaim 26.12 ( +0.00%) 35.00 ( +32.72%) > Slave reclaim (stddev) 29.42 ( +0.00%) 74.91 (+149.55%) > Slave scan 31.38 ( +0.00%) 35.00 ( +11.20%) > Slave scan (stddev) 33.31 ( +0.00%) 74.91 (+121.24%) > Slave kswapd reclaim 34259.00 ( +0.00%) 33469.88 ( -2.30%) > Slave kswapd reclaim (stddev) 925.15 ( +0.00%) 565.07 ( -38.88%) > Slave kswapd scan 34354.62 ( +0.00%) 33555.75 ( -2.33%) > Slave kswapd scan (stddev) 969.62 ( +0.00%) 581.70 ( -39.97%) > > In the control case, the differences in elapsed time, number of major > faults taken, and reclaim statistics are within the noise for both the > master and the slave job. > > 600M-280M-vanilla 600M-280M-patched > Master walltime (s) 595.13 ( +0.00%) 553.19 ( -7.04%) > Master walltime (stddev) 8.31 ( +0.00%) 2.57 ( -61.64%) > Master major faults 3729.75 ( +0.00%) 783.25 ( -78.98%) > Master major faults (stddev) 258.79 ( +0.00%) 226.68 ( -12.36%) > Master reclaim 705.00 ( +0.00%) 29.50 ( -95.68%) > Master reclaim (stddev) 232.87 ( +0.00%) 44.72 ( -80.45%) > Master scan 714.88 ( +0.00%) 30.00 ( -95.67%) > Master scan (stddev) 237.44 ( +0.00%) 45.39 ( -80.54%) > Master kswapd reclaim 114.75 ( +0.00%) 50.00 ( -55.94%) > Master kswapd reclaim (stddev) 128.51 ( +0.00%) 9.45 ( -91.93%) > Master kswapd scan 115.75 ( +0.00%) 50.00 ( -56.32%) > Master kswapd scan (stddev) 130.31 ( +0.00%) 9.45 ( -92.04%) > Slave walltime (s) 631.18 ( +0.00%) 577.68 ( -8.46%) > Slave walltime (stddev) 9.89 ( +0.00%) 3.63 ( -57.47%) > Slave major faults 28401.75 ( +0.00%) 14656.75 ( -48.39%) > Slave major faults (stddev) 2629.97 ( +0.00%) 1911.81 ( -27.30%) > Slave reclaim 65400.62 ( +0.00%) 1479.62 ( -97.74%) > Slave reclaim (stddev) 11623.02 ( +0.00%) 1482.13 ( -87.24%) > Slave scan 9050047.88 ( +0.00%) 95968.25 ( -98.94%) > Slave scan (stddev) 1912786.94 ( +0.00%) 93390.71 ( -95.12%) > Slave kswapd reclaim 327894.50 ( +0.00%) 227099.88 ( -30.74%) > Slave kswapd reclaim (stddev) 22289.43 ( +0.00%) 16113.14 ( -27.71%) > Slave kswapd scan 34987335.75 ( +0.00%) 1362367.12 ( -96.11%) > Slave kswapd scan (stddev) 2523642.98 ( +0.00%) 156754.74 ( -93.79%) > > Here, the available memory is limited to 320 MB, the machine is > overcommitted by 280 MB. The soft limit of the master is 300 MB, that > of the slave merely 20 MB. > > Looking at the slave job first, it is much better off with the patched > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by > a third. The result is much fewer major faults taken, which in turn > lets the job finish quicker. > > It would be a zero-sum game if the improvement happened at the cost of > the master but looking at the numbers, even the master performs better > with the patched kernel. In fact, the master job is almost unaffected > on the patched kernel compared to the control case. > > This is an odd phenomenon, as the patch does not directly change how > the master is reclaimed. An explanation for this is that the severe > overreclaim of the slave in the unpatched kernel results in the master > growing bigger than in the patched case. Combining the fact that > memcgs are scanned according to their size with the increased refault > rate of the overreclaimed slave triggering global reclaim more often > means that overall pressure on the master job is higher in the > unpatched kernel. > > At any rate, the patched kernel seems to do a much better job at both > overall resource allocation under soft limit overcommit as well as the > requested prioritization of the master job. > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> Thank you for your work and the result seems atractive and code is much simpler. My small concerns are.. 1. This approach may increase latency of direct-reclaim because of priority=0. 2. In a case numa-spread/interleave application run in its own container, pages on a node may paged-out again and again becasue of priority=0 if some other application runs in the node. It seems difficult to use soft-limit with numa-aware applications. Do you have suggestions ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>