On Mon, Sep 21, 2015 at 11:52:39AM +0100, Mel Gorman wrote: > The zonelist cache (zlc) was introduced to skip over zones that were > recently known to be full. This avoided expensive operations such as the > cpuset checks, watermark calculations and zone_reclaim. The situation > today is different and the complexity of zlc is harder to justify. > > 1) The cpuset checks are no-ops unless a cpuset is active and in general > are a lot cheaper. > > 2) zone_reclaim is now disabled by default and I suspect that was a large > source of the cost that zlc wanted to avoid. When it is enabled, it's > known to be a major source of stalling when nodes fill up and it's > unwise to hit every other user with the overhead. > > 3) Watermark checks are expensive to calculate for high-order > allocation requests. Later patches in this series will reduce the cost > of the watermark checking. > > 4) The most important issue is that in the current implementation it > is possible for a failed THP allocation to mark a zone full for order-0 > allocations and cause a fallback to remote nodes. > > The last issue could be addressed with additional complexity but as the > benefit of zlc is questionable, it is better to remove it. If stalls > due to zone_reclaim are ever reported then an alternative would be to > introduce deferring logic based on a timeout inside zone_reclaim itself > and leave the page allocator fast paths alone. > > The impact on page-allocator microbenchmarks is negligible as they don't > hit the paths where the zlc comes into play. Most page-reclaim related > workloads showed no noticeable difference as a result of the removal. > > The impact was noticeable in a workload called "stutter". One part uses a > lot of anonymous memory, a second measures mmap latency and a third copies > a large file. In an ideal world the latency application would not notice > the mmap latency. On a 2-node machine the results of this patch are > > stutter > 4.3.0-rc1 4.3.0-rc1 > baseline nozlc-v4 > Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%) > 1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%) > 2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%) > 3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%) > Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%) > Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%) > Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%) > Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%) > Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%) > Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%) > Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%) > Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%) > Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%) > Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%) > Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%) > Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%) > Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%) > > Note that the maximum stall latency went from 24 seconds to 12 which is still > bad but an improvement. The milage varies considerably 2-node machine on an > earlier test went from 494 seconds to 47 seconds and a 4-node machine that > tested an earlier version of this patch went from a worst case stall time of > 6 seconds to 67ms. The nature of the benchmark is inherently unpredictable > as it is hammering the system and the milage will vary between machines. > > There is a secondary impact with potentially more direct reclaim because > zones are now being considered instead of being skipped by zlc. In this > particular test run it did not occur so will not be described. However, > in at least one test the following was observed > > 1. Direct reclaim rates were higher. This was likely due to direct reclaim > being entered instead of the zlc disabling a zone and busy looping. > Busy looping may have the effect of allowing kswapd to make more > progress and in some cases may be better overall. If this is found then > the correct action is to put direct reclaimers to sleep on a waitqueue > and allow kswapd make forward progress. Busy looping on the zlc is even > worse than when the allocator used to blindly call congestion_wait(). > > 2. There was higher swap activity as direct reclaim was active. > > 3. Direct reclaim efficiency was lower. This is related to 1 as more > scanning activity also encountered more pages that could not be > immediately reclaimed > > In that case, the direct page scan and reclaim rates are noticeable but > it is not considered a problem for a few reasons > > 1. The test is primarily concerned with latency. The mmap attempts are also > faulted which means there are THP allocation requests. The ZLC could > cause zones to be disabled causing the process to busy loop instead > of reclaiming. This looks like elevated direct reclaim activity but > it's the correct action to take based on what processes requested. > > 2. The test hammers reclaim and compaction heavily. The number of successful > THP faults is highly variable but affects the reclaim stats. It's not a > realistic or reasonable measure of page reclaim activity. > > 3. No other page-reclaim intensive workload that was tested showed a problem. > > 4. If a workload is identified that benefitted from the busy looping then it > should be fixed by having direct reclaimers sleep on a wait queue until > woken by kswapd instead of busy looping. We had this class of problem before > when congestion_waits() with a fixed timeout was a brain damaged decision > but happened to benefit some workloads. > > If a workload is identified that relied on the zlc to busy loop then it > should be fixed correctly and have a direct reclaimer sleep on a waitqueue > until woken by kswapd. > > Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> > Acked-by: David Rientjes <rientjes@xxxxxxxxxx> > Acked-by: Christoph Lameter <cl@xxxxxxxxx> > Acked-by: Vlastimil Babka <vbabka@xxxxxxx> > Acked-by: Michal Hocko <mhocko@xxxxxxxx> > --- > include/linux/mmzone.h | 74 ----------------- > mm/page_alloc.c | 212 ------------------------------------------------- > 2 files changed, 286 deletions(-) This patch and its results look great! And I agree, should this affect the balance between kswapd and direct reclaim, it should be fixed explicitely and not rely on something as unrelated as the zonelist cache. Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>