Hi Richard, This series is following up from your mail at http://www.spinics.net/lists/kvm/msg80080.html . I am pleased the lock contention is now reduced but acknowledge that the scanning rates are stupidly high. Fortunately, I am reasonably confident I know what is going wrong. If all goes according to plain this should drastically reduce the amount of time your workload spends on compaction. I would very much appreciate if you drop the MM patches (i.e. keep the btrfs patches) and replace them with this series. I know that Rik's patches are dropped and this is deliberate. I reimplemented his idea on top of the fifth patch on this series to cover both the migrate and free scanners. Thanks to Rik who discussed how the idea could be reimplemented on IRC which was very helpful. Hopefully the patch actually reflects what we discussed :) Shaohua, I would also appreciate if you tested this series. I picked up one of your patches but replaced another and want to make sure that the workload you were investigating is still ok. === Richard Davies and Shaohua Li have both reported lock contention problems in compaction on the zone and LRU locks as well as significant amounts of time being spent in compaction. It is critical that performance gains from THP are not offset by the cost of allocating them in the first place. This series aims to reduce lock contention and scanning rates. Patch 1 is a fix for c67fe375 (mm: compaction: Abort async compaction if locks are contended or taking too long) to properly abort in all cases when contention is detected. Patch 2 defers acquiring the zone->lru_lock as long as possible. Patch 3 defers acquiring the zone->lock as lock as possible. Patch 4 reverts Rik's "skip-free" patches as the core concept gets reimplemented later and the remaining patches are easier to understand if this is reverted first. Patch 5 adds a pageblock-skip bit to the pageblock flags to cache what pageblocks should be skipped by the migrate and free scanners. This drastically reduces the amount of scanning compaction has to do. Patch 6 reimplements something similar to Rik's idea except it uses the pageblock-skip information to decide where the scanners should restart from and does not need to wrap around. I tested this on 3.6-rc5 as that was the kernel base that the earlier threads worked on. It will need a bit of work to rebase on top of Andrews tree for merging due to other compaction changes but it will not be a major problem. Kernels tested were vanilla 3.6-rc5 lesslock Patches 1-3 revert Patches 1-4 cachefail Patches 1-5 skipuseless Patches 1-6 Stress high-order allocation tests looked ok. STRESS-HIGHALLOC 3.6.0 3.6.0-rc5 3.6.0-rc5 3.6.0-rc5 3.6.0-rc5 rc5-vanilla lesslock revert cachefail skipuseless Pass 1 17.00 ( 0.00%) 19.00 ( 2.00%) 29.00 (12.00%) 24.00 ( 7.00%) 20.00 ( 3.00%) Pass 2 16.00 ( 0.00%) 19.00 ( 3.00%) 39.00 (23.00%) 37.00 (21.00%) 35.00 (19.00%) while Rested 88.00 ( 0.00%) 88.00 ( 0.00%) 88.00 ( 0.00%) 85.00 (-3.00%) 86.00 (-2.00%) Success rates are improved a bit by the series as there are fewer opporunities to race with other allocation requests if compaction is scanning less. I recognise the success rates are still low but patches that tackle parts of that are in Andrews tree already. The time to complete the tests did not vary that much and are uninteresting as were the vmstat statistics so I will not present them here. Using ftrace I recorded how much scanning was done by compaction and got this 3.6.0 3.6.0-rc5 3.6.0-rc5 3.6.0-rc5 3.6.0-rc5 rc5-vanilla lesslock revert cachefail skipuseless Total free scanned 185020625 223313210 744553485 37149462 29231432 Total free isolated 845094 1174759 4301672 906689 721963 Total free efficiency 0.0046% 0.0053% 0.0058% 0.0244% 0.0247% Total migrate scanned 187708506 143133150 428180990 21941574 12288851 Total migrate isolated 714376 1081134 3950098 711357 590552 Total migrate efficiency 0.0038% 0.0076% 0.0092% 0.0324% 0.0481% The efficiency is worthless because of the nature of the test and the number of failures. The really interesting point as far as this patch series is concerned is the number of pages scanned. Note that reverting Rik's patches massively increases the number of pages scanned indicating that those patches really did make a huge difference to CPU usage. However, caching what pageblocks should be skipped has a much higher impact. With patches 1-5 applied, free page scanning is reduced by 80% in comparison to the vanilla kernel and migrate page scanning is reduced by 88%. If the basic concept of Rik's patches are implemened on top then scanning is even further reduced. Free scanning is reduced by 84% and migrate scanning is reduced by 93%. include/linux/mmzone.h | 5 +- include/linux/pageblock-flags.h | 19 +- mm/compaction.c | 407 ++++++++++++++++++++++++--------------- mm/internal.h | 13 +- mm/page_alloc.c | 6 +- 5 files changed, 284 insertions(+), 166 deletions(-) -- 1.7.9.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html