Hi, This is the third version of the patchset. The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973 (lkml wasn't CCed at the time so I cannot find it in lwn.net archives). There were no major objections. The second version has been posted here http://lwn.net/Articles/548191/ as a part of a longer and spicier thread which started after LSF here: https://lwn.net/Articles/548192/ So this is a back to trees repost which tries to get only the guts from the original post. Changes between RFC (aka V1) -> V2 As there were no major objections there were only some minor cleanups since the last version and I have moved "memcg: Ignore soft limit until it is explicitly specified" to the end of the series. Changes between V2 -> V3 No changes in the code since the last version. I have just rebased the series on top of the current mmotm tree. The most controversial part has been dropped (the last patch "memcg: Ignore soft limit until it is explicitly specified") so there are no semantical changes to the soft limit behavior. This makes this work mostly a code clean up and code reorganization. Nevertheless, this is enough to make the soft limit work more efficiently according to my testing and groups above the soft limit are reclaimed much less as a result. The basic idea is quite simple. Pull soft reclaim into shrink_zone in the first step and get rid of the previous soft reclaim infrastructure. shrink_zone is done in two passes now. First it tries to do the soft limit reclaim and it falls back to reclaim-all-mode if no group is over the limit or no pages have been scanned. The second pass happens at the same priority so the only time we waste is the memcg tree walk which shouldn't be a big deal [1]. There is certainly room for improvements in that direction (e.g. do not do soft reclaim pass if no limit has been specified for any group yet or if we know that no group is bellow the limit). But let's keep it simple for now. As a bonus we will get rid of a _lot_ of code by this and soft reclaim will not stand out like before. The clean up is in a separate patch because I felt it would be easier to review that way. The second step is soft limit reclaim integration into targeted reclaim. It should be rather straight forward. Soft limit has been used only for the global reclaim so far but it makes for any kind of pressure coming from up-the-hierarchy, including targeted reclaim. My primary test case was a parallel kernel (each make is run with -j4 with a distribution .config in a separate cgroup without any hard limit) build on a 8 CPU machine booted with 1GB memory. I was mostly interested in 2 setups. Default - no soft limit set and - and 0 soft limit set to both groups. The first one should tell us whether the rework regresses the default behavior while the second one should show us improvements in an extreme case where both workloads are always over the soft limit. /usr/bin/time -v has been used to collect the statistics and each configuration had 3 runs after fresh boot without any other load on the system. * No-limit Base kernel (before) System: min:238.26 max:240.87 avg:239.85 stdev:0.90 User: min:1166.40 max:1173.79 avg:1169.92 stdev:2.41 Elapsed: min:09:38.06 max:09:57.96 avg:09:48.80 stdev:00:07.14 Rework (after) System: min:238.88[100.3%] max:242.43[100.6%] avg:240.70[100.4%] stdev:1.62 User: min:1168.61[100.2%] max:1171.15[99.8%] avg:1169.69[100%] stdev:0.82 Elapsed: min:09:31.85[98.9%] max:09:57.74[100.0%] avg:09:47.57[99.8%] stdev:00:09.54 The numbers are within stdev so I think we are doing good here. * 0-limit Base kernel (before) System: min:244.74 max:248.66 avg:246.39 stdev:1.20 User: min:1190.51 max:1195.70 avg:1193.77 stdev:1.66 Elapsed: min:12:39.73 max:12:56.79 avg:12:48.33 stdev:00:06.59 Rework (after) System: min:238.64[97.5%] max:242.34[97.5%] avg:240.67[97.7%] stdev:1.26 User: min:1166.20[98%] max:1170.98[98%] avg:1168.17[97.9%] stdev:1.44 Elapsed: min:09:47.53[77.3%] max:09:59.05[77.1%] avg:09:53.23[77.2%] stdev:00:04.72 We can see 2% time decrease for both System and User time which is not rocket high but sounds like a good outcome from a cleanup ;). It is even more interesting to check the Elapsed time numbers which show that the parallel load is much more effective. I haven't looked into the specific reasons for this boost up deeply but I would guess that priority-0 reclaim done in the original implementation should be a big contributor. Page fault statistics tell us at least part of the story: Base Major min:29686 max:34224 avg:31028.67 stdev:1503.11 Minor min:35919443 max:35980518 avg:35945936.50 stdev:21340.41 Rework Major min:332[1.1%] max:883[2.6%] avg:478.67[1.5%] stdev:187.46 Minor min:35583644[99%] max:35660073[99%] avg:35621868.67[99%] stdev:29240.64 While the minor faults are within the noise the major faults are reduced considerably. This looks like an aggressive pageout during the reclaim and that pageout affects the working set presumably. If we look at the same numbers for no-limit (aka no soft limit in action) we get very similar numbers to 0-limit rework: Base Major min:208.00 max:454.00, avg:375.17 stdev:86.73 Rework Major min:270.00 max:730.00, avg:431.67 stdev:143.58 So this clearly points at the priority-0 reclaim. Shortlog says: Michal Hocko (3): memcg: integrate soft reclaim tighter with zone shrinking code memcg: Get rid of soft-limit tree infrastructure vmscan, memcg: Do softlimit reclaim also for targeted reclaim And the diffstat: include/linux/memcontrol.h | 12 +- mm/memcontrol.c | 387 +++----------------------------------------- mm/vmscan.c | 62 ++++--- 3 files changed, 65 insertions(+), 396 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>