On Wed, Oct 08, 2014 at 01:47:25PM -0400, Johannes Weiner wrote: > On Wed, Oct 08, 2014 at 05:33:29PM +0200, Michal Hocko wrote: > > On Tue 07-10-14 21:11:06, Johannes Weiner wrote: > > > On Tue, Oct 07, 2014 at 03:59:50PM +0200, Michal Hocko wrote: > > > > Another part that matters is the size. Memcgs might be really small and > > > > that changes the math. Large reclaim target will get to low prio reclaim > > > > and thus the excessive reclaim. > > > > > > I already addressed page size vs. memcg size before. > > > > > > However, low priority reclaim does not result in excessive reclaim. > > > The reclaim goal is checked every time it scanned SWAP_CLUSTER_MAX > > > pages, and it exits if the goal has been met. See shrink_lruvec(), > > > shrink_zone() etc. > > > > Now I am confused. shrink_zone will bail out but shrink_lruvec will loop > > over nr[...] until they are empty and only updates the numbers to be > > roughly proportional once: > > > > if (nr_reclaimed < nr_to_reclaim || scan_adjusted) > > continue; > > > > /* > > * For kswapd and memcg, reclaim at least the number of pages > > * requested. Ensure that the anon and file LRUs are scanned > > * proportionally what was requested by get_scan_count(). We > > * stop reclaiming one LRU and reduce the amount scanning > > * proportional to the original scan target. > > */ > > [...] > > scan_adjusted = true; > > > > Or do you rely on > > /* > > * It's just vindictive to attack the larger once the smaller > > * has gone to zero. And given the way we stop scanning the > > * smaller below, this makes sure that we only make one nudge > > * towards proportionality once we've got nr_to_reclaim. > > */ > > if (!nr_file || !nr_anon) > > break; > > > > and SCAN_FILE because !inactive_file_is_low? > > That function is indeed quite confusing. > > Once nr_to_reclaim has been met, it looks at both LRUs and decides > which one has the smaller scan target left, sets it to 0, and then > scales the bigger target in proportion - that means if it scanned 10% > of nr[file], it sets nr[anon] to 10% of its original size, minus the > anon pages it already scanned. Based on that alone, overscanning is > limited to twice the requested size, i.e. 4MB for a 2MB THP page, > regardless of how low the priority drops. Sorry, this conclusion is incorrect. The proportionality code can indeed lead to more overreclaim than that, although I think this is actually not intended: the comment says "this makes sure we only make one nudge towards proportionality once we've got nr_to_reclaim," but once scan_adjusted we never actually check anymore. We we can end up making a lot more nudges toward proportionality. However, the following still applies, so it shouldn't matter: > In practice, the VM is heavily biased to avoid swapping. The forceful > SCAN_FILE you point out is one condition that avoids the proportional > scan most of the time. But even the proportional scan is heavily > biased towards cache - every cache insertion decreases the file > recent_rotated/recent_scanned ratio, whereas anon faults do not. > > On top of that, anon pages start out on the active list, whereas cache > starts on the inactive, which means that the majority of the anon scan > target - should there be one - usually translates to deactivation. > > So in most cases, I'd expect the scanner to bail after nr_to_reclaim > cache pages, and in low cache situations it might scan up to 2MB worth > of anon pages, a small amount of which it might swap. > > I don't particularly like the decisions the current code makes, but it > should work. We have put in a lot of effort to prevent overreclaim in > the last few years, and a big part of this was decoupling the priority > level from the actual reclaim results. Nowadays, the priority level > should merely dictate the scan window and have no impact on the number > of pages actually reclaimed. I don't expect that it will, but if it > does, that's a reclaim bug that needs to be addressed. If we ask for > N pages, it should never reclaim significantly more than that, > regardless of how aggressively it has to scan to accomplish that. That being said, I don't get the rationale behind the proportionality code in shrink_lruvec(). The patch that introduced it - e82e0561dae9 ("mm: vmscan: obey proportional scanning requirements for kswapd") - mentions respecting swappiness, but as per above we ignore swappiness anyway until we run low on cache and into actual pressure. And under pressure, once we struggle to reclaim nr_to_reclaim, proportionality enforces itself when one LRU type target hits zero and we continue to scan the one for which more pressure was allocated. But as long as we scan both lists at the same SWAP_CLUSTER_MAX rate and have no problem getting nr_to_reclaim pages with left-over todo for *both* LRU types, what's the point of going on? The justification for enforcing proportionality in direct reclaim is particularly puzzling: --- commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813 Author: Mel Gorman <mgorman@xxxxxxx> Date: Wed Jun 4 16:10:49 2014 -0700 mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY [...] 3.15.0-rc5 3.15.0-rc5 shrinker proportion Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%) Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%) Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%) Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%) Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%) Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%) The test cases are running multiple dd instances reading sparse files. The results are within the noise for the small test machine. The impact of the patch is more noticable from the vmstats 3.15.0-rc5 3.15.0-rc5 shrinker proportion Minor Faults 35154 36784 Major Faults 611 1305 Swap Ins 394 1651 Swap Outs 4394 5891 Allocation stalls 118616 44781 Direct pages scanned 4935171 4602313 Kswapd pages scanned 15921292 16258483 Kswapd pages reclaimed 15913301 16248305 Direct pages reclaimed 4933368 4601133 Kswapd efficiency 99% 99% Kswapd velocity 670088.047 682555.961 Direct efficiency 99% 99% Direct velocity 207709.217 193212.133 Percentage direct scans 23% 22% Page writes by reclaim 4858.000 6232.000 Page writes file 464 341 Page writes anon 4394 5891 Note that there are fewer allocation stalls even though the amount of direct reclaim scanning is very approximately the same. --- The timings show nothing useful, but the statistics strongly speak *against* this patch. Sure, direct reclaim invocations are reduced, but everything else worsened: minor faults increased, major faults doubled(!), swapins quadrupled(!!), swapins increased, pages scanned increased, pages reclaimed increased, reclaim page writes increased. If direct reclaim is invoked at that rate, kswapd is failing at its job, and the solution shouldn't be to overscan in direct reclaim. On the other hand, multi-threaded sparse readers are kind of expected to overwhelm a single kswapd worker, I'm not sure we should be tuning allocation latency to such a workload in the first place. Mel, do you maybe remember details that are not in the changelogs? Because based on them alone, I think we should look at other ways to ensure we scan with the right amount of vigor... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>